# Capstone Project - Car Accident Severity (Week 2 & 3)

##### Olatunde Awobuluyi

###  Data Science Specialization Capstone Project offered by IBM on Coursera


## Table of Contents

* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Exploratory Data Analysis](#Exploration) 

## Introduction: Business Problem <a name="introduction"></a>

Between **2016 & 2019**, there have been **23731** police recorded injuries due to road traffic collision in **11 districts** in Northern Ireland. Each record indicates how severe the outcome of the collision is. 

The severity of these outcomes ranges from **1(Fatal injury collision)** to **3(Slight injury collision)**; with **2(Serious injury collision)**  being the middle value.

During this time period (2016 to 2019), **Belfast City** had the highest number of recorded incidents with **5872** cases, and the district of **Armagh City**, **Banbridge & Craigavon** having the second-highest number of recorded incidents with **2329** cases. 

The dataset we have will be useful to the police of Northern Ireland, however, we want to explore and create a prediction model for Belfast City. 

We believe the outcome of this exercise will be very useful to the **Belfast City Council** in designing and constructing new roads or modifying existing roads when considering variables like speed limits, one way, the position of traffic lights etc

## Data <a name="data"></a>

The entire dataset that includes all **11** districts has **23731** entries as mentioned earlier. Records from 2016 to 2019 were joined together to create a <a href="https://app.box.com/s/yqr1w8rjlro77ieckqcwmfub8bdcpja0">richer data set</a>.

However, the following **9** fields have either **20902** or **20903** missing values.

    a_jdet            20902
    a_jcont           20902
    a_pedhum          20903
    a_pedphys         20903
    a_light           20902
    a_weat            20902
    a_roadsc          20902
    a_speccs          20902
    a_chaz            20902
    a_scene           20902

**The name of these fields will be changed to something easier to read in the next section**
    
We won’t be dropping the missing data because of the sheer amount. Replacing these empty fields or data by the **frequency(mode)** of the categorical data in each of these columns will be the approach used. 

Please view the data guide document  <a href="https://app.box.com/s/x734lsyrb4jkavxqtoxjvmkoqyg5odpo">data guide document</a> for a description of each of the aforementioned fields.

Dropping all rows with missing data shows that the target variable **'a_type'** for collision severity will be unbalanced. Please view the <a href="https://app.box.com/s/x734lsyrb4jkavxqtoxjvmkoqyg5odpo">data guide document</a>  for a description of the **'a_type'** field.

### Trimming the Data

Before any exploratory analysis is done, the original dataset comprising incidents for all districts will be trimmed down to entries just for **Belfast City** and checked if the trimmed dataset is balanced. 


#### This data set was published with an Open Government Licence on the data.gov.uk website and its titled <a href="https://data.gov.uk/dataset/4ddb6259-b47c-44e6-ac91-6d95dd527d26/police-recorded-injury-road-traffic-collision-statistics-northern-ireland-2016">"Police Recorded Injury Road Traffic Collision Statistics Northern Ireland 2016" </a> 

#### The dataset for 2017, 2018, and 2019 have been added to this data set to make it richer for analysis

In [1]:
# Importing all the necessary libraries for data analysis
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
filename = "/home/tunde/Documents/COURSERA-DATA-SCIENCE-SPECIALIZATION/4_APPLIED_DATA_SCIENCE_CAPSTONE/Data_Set/Police_Record/collision2016_to_2019.csv"


In [3]:
df = pd.read_csv(filename)


In [4]:
df.head()

Unnamed: 0,a_year,a_ref,a_District,a_type,a_veh,a_cas,a_wkday,a_day,a_month,a_hour,...,a_jdet,a_jcont,a_pedhum,a_pedphys,a_light,a_weat,a_roadsc,a_speccs,a_chaz,a_scene
0,2016,1,NEMD,3,1,1,FRI,1,1,1,...,,,,,,,,,,
1,2016,2,DCST,3,2,1,FRI,1,1,3,...,,,,,,,,,,
2,2016,3,ARND,2,1,1,FRI,1,1,3,...,1.0,1.0,1.0,1.0,6.0,2.0,4.0,1.0,1.0,1.0
3,2016,4,BELC,3,2,2,FRI,1,1,3,...,,,,,,,,,,
4,2016,5,BELC,3,2,1,FRI,1,1,15,...,,,,,,,,,,


#### Column header names are changed to something more readable 

In [5]:
# Dictionary with new column headers
New_Column_Headers = {'a_year':'Year', 'a_ref':'Reference_No', 'a_District':'District', 'a_type':'Collision_Severity', 'a_veh':'Number_of_Vehicles', 'a_cas':'Number_of_Casualities', 'a_wkday':'Weekday_of_Collision', 'a_day':'Week_of_Collision', 'a_month':'Month_of_Collision', 'a_hour':'Hour_of_Collision', 'a_min':'Minute_of-Collision', 'a_gd1':'Location_Easting', 'a_gd2':'Location_Northing', 'a_ctype':'Carriageway_Type', 'a_speed':'Speed_Limit', 'a_jdet':'Junction_Detail', 'a_jcont':'Junction-Control', 'a_pedhum':'Pedestrian_Crossing_Human_Control', 'a_pedphys':'Pedestrian_Crossing_Physical_Control', 'a_light':'Light_Conditions', 'a_weat':'Weather_Conditions', 'a_roadsc':'Road_Surface_Conditions', 'a_speccs':'Special_Conditions_at_Site', 'a_chaz':'Carriageway_Harzard', 'a_scene':'Police_at_Scene'}
new_df = df.rename(columns=New_Column_Headers)
new_df            

Unnamed: 0,Year,Reference_No,District,Collision_Severity,Number_of_Vehicles,Number_of_Casualities,Weekday_of_Collision,Week_of_Collision,Month_of_Collision,Hour_of_Collision,...,Junction_Detail,Junction-Control,Pedestrian_Crossing_Human_Control,Pedestrian_Crossing_Physical_Control,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Harzard,Police_at_Scene
0,2016,1,NEMD,3,1,1,FRI,1,1,1,...,,,,,,,,,,
1,2016,2,DCST,3,2,1,FRI,1,1,3,...,,,,,,,,,,
2,2016,3,ARND,2,1,1,FRI,1,1,3,...,1,1,1,1,6,2,4,1,1,1
3,2016,4,BELC,3,2,2,FRI,1,1,3,...,,,,,,,,,,
4,2016,5,BELC,3,2,1,FRI,1,1,15,...,,,,,,,,,,
5,2016,6,DCST,3,2,3,FRI,1,1,15,...,,,,,,,,,,
6,2016,7,BELC,3,2,2,FRI,1,1,16,...,,,,,,,,,,
7,2016,8,BELC,3,2,1,SAT,2,1,0,...,,,,,,,,,,
8,2016,9,BELC,3,2,2,SAT,2,1,8,...,,,,,,,,,,
9,2016,10,ANTN,2,3,7,SAT,2,1,8,...,1,1,1,1,2,2,2,1,1,1


### ----------------------------------------------------------------------------------------------------------------------------------------------------------------

## Exploratory Data Analysis: <a name="Exploration"></a>

For exploratory data analysis, the **categorical target variable 'Collision_Severity'** as mentioned earlier will be the evaluated via correlation with the independent variables **'Number_of_Vehicles'**, **'Speed_Limit'**, **'Junction_Detail'**, **'Junction_Control'**, **'Road_Surface_Conditions'**, **'Light_Conditions'**, **'Weather_Conditions'**, and so on. 

### Checking the number of entries, columns and data types for each colunm for the data set 

In [6]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23731 entries, 0 to 23730
Data columns (total 25 columns):
Year                                    23731 non-null int64
Reference_No                            23731 non-null int64
District                                23731 non-null object
Collision_Severity                      23731 non-null int64
Number_of_Vehicles                      23731 non-null int64
Number_of_Casualities                   23731 non-null int64
Weekday_of_Collision                    23731 non-null object
Week_of_Collision                       23731 non-null int64
Month_of_Collision                      23731 non-null int64
Hour_of_Collision                       23731 non-null int64
Minute_of-Collision                     23731 non-null int64
Location_Easting                        23731 non-null int64
Location_Northing                       23731 non-null int64
Carriageway_Type                        23731 non-null int64
Speed_Limit                        

### Describing the data 

In [7]:
new_df.describe(include = "all")

Unnamed: 0,Year,Reference_No,District,Collision_Severity,Number_of_Vehicles,Number_of_Casualities,Weekday_of_Collision,Week_of_Collision,Month_of_Collision,Hour_of_Collision,...,Junction_Detail,Junction-Control,Pedestrian_Crossing_Human_Control,Pedestrian_Crossing_Physical_Control,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Harzard,Police_at_Scene
count,23731.0,23731.0,23731,23731.0,23731.0,23731.0,23731,23731.0,23731.0,23731.0,...,23731.0,23731.0,23731.0,23731.0,23731.0,23731.0,23731.0,23731.0,23731.0,23731.0
unique,,,11,,,,7,,,,...,9.0,6.0,4.0,7.0,8.0,11.0,11.0,7.0,7.0,3.0
top,,,BELC,,,,FRI,,,,...,,,,,,,,,,
freq,,,5872,,,,3931,,,,...,20902.0,20902.0,20903.0,20903.0,20902.0,20902.0,20902.0,20902.0,20902.0,20902.0
mean,2017.458303,2971.237917,,2.87097,1.867304,1.532721,,15.652817,6.647929,13.72361,...,,,,,,,,,,
std,1.118618,1720.229069,,0.36335,0.652304,1.018194,,8.765668,3.481886,4.989497,...,,,,,,,,,,
min,2016.0,1.0,,1.0,1.0,1.0,,1.0,1.0,0.0,...,,,,,,,,,,
25%,2016.0,1484.0,,3.0,1.0,1.0,,8.0,4.0,10.0,...,,,,,,,,,,
50%,2017.0,2967.0,,3.0,2.0,1.0,,16.0,7.0,14.0,...,,,,,,,,,,
75%,2018.0,4450.0,,3.0,2.0,2.0,,23.0,10.0,17.0,...,,,,,,,,,,


There are a lot of missing values in the last 10 columns. We can confirm that there are **20903** missing values for these columns. This gives us **(23731 - 20903)** number of entries **( 2828 entries)**. We may have to drop these columns later

In [8]:
new_df = df.replace(r'^\s*$', np.nan, regex=True)
new_df.isnull().sum(axis=0)


a_year            0
a_ref             0
a_District        0
a_type            0
a_veh             0
a_cas             0
a_wkday           0
a_day             0
a_month           0
a_hour            0
a_min             0
a_gd1             0
a_gd2             0
a_ctype           0
a_speed           0
a_jdet        20902
a_jcont       20902
a_pedhum      20903
a_pedphys     20903
a_light       20902
a_weat        20902
a_roadsc      20902
a_speccs      20902
a_chaz        20902
a_scene       20902
dtype: int64

###### We have 11 unique Districts for Northern Ireland as entries in the dataset. Here we can see "BELC" (Belfast City) has the highest number of entries at 5872. We might want to observe the how speed limits "a_speed" affects Collision in the Belfast City district 

In [None]:
df['a_District'].value_counts()

### 5. There are a lot  of missing values in the data as shown in the output below

In [None]:
df.isnull()

### 6. In the output below, we see that after dropping all rows that have "NaN" values, we have 754 entries for our training data. 

In [None]:
new_df = df.dropna()

new_df

### 7. The number of entries per district has reduced drastically

In [None]:
new_df['a_District'].value_counts()

In [None]:
new_df.info()

### 8. Describing the data 

In [None]:
new_df.describe(include = "all")

### 9.  checking how balanced the data is

### I won't be dropping the data. Instead, I will replacing measing values with mode values. As  you can see, dropping all missing values will remove all incidents that resulted in "slight injury(3)" from the dataset

In [None]:
new_df['a_type'].value_counts()

In [None]:
df['a_type'].value_counts()

### 10. A quick look at what the dataset for Belfast City would look like

In [None]:
#df.cats[df.cats=='a']
bel_df = df[df.a_District == 'BELC']
bel_df


### 11. A quick look at what the description for the dataset of Belfast City 

In [None]:
bel_df.describe(include = "all")

## 12. Checking how balanced the dataset for Belfast City is 

### observations for collison severity are still unbalanced

In [None]:
bel_df['a_type'].value_counts()

# 13 Replacing the missing values



All


In [None]:
missing_data_cols = ["a_jdet", "a_jcont", "a_pedhum", "a_pedphys", "a_light", "a_weat", "a_roadsc", "a_speccs", "a_chaz", "a_chaz", "a_scene"]

# function to replace the missing values in colunms with missing values with the mode values 
# and convert the data types in these columns from object data type to int64 data type

def replace_with_mode(dataframe, list_of_columns):
    
    for v in list_of_columns:
        
        mean = 0
        
        mean = pd.to_numeric(dataframe[v].value_counts().idxmax(), errors='coerce')
        
        dataframe[v] = pd.to_numeric(dataframe[v], errors='coerce')
        
        dataframe[v].replace(np.nan,mean,inplace=True)        
        
    return dataframe

test = replace_with_mode(df, missing_data_cols)

print(test.dtypes)

### 14. A quick look at what the dataset for Belfast City would look like again

In [None]:
bel_df = df[df.a_District == 'BELC']
bel_df

In [None]:
bel_df.isnull().sum(axis=0)

In [None]:
sns.regplot(x='a_speed', y="a_type", data=df)

In [None]:
sns.residplot(df['a_speed'], df["a_type"])