# Traffic Crashes in Chicago

* Student name: Kinoti Martin Mwenda
* Student pace: Full-time
* Scheduled project review date/time: 26th May 2023
* Instructor name:
* Blog post URL

## 1. Project Overview

The main aim of this project is to analyse `Traffic Crashes` dataset in the city of Chicago, as directed by the government. Through this analysis, we are to achieve the following:

* Identify the areas within the city with the most number of traffic accidents.
* Determine the main causes of the increased number of traffic accidents in the city.
* Come up with a model that given the identified accident contributors, can predict the likelihood of such incideces taking place

### 1.1 Problem Statement

The government of Chicago- Law Enforcement- has recorded an increase in traffic accidents on their roads. We are tasked to flag some of the leading contributors to this, and come up with recommendations on the same.

## 1.2 Objectives

#### Main Objective
To create an accurate model that predicts future trends on traffic accidents in the city of Chicago, given the contributing features.


#### Specific Objectives

To determine how different features such as weather condition, and speed influence occurence of traffic accidents.
To come up with recommendations on the measures to be taken, to ensure that traffic acciedents are minimized.

## 1.3 Project Design

This project is broken down into various categories, which are:

* Data Exploration & Analysis
* Data Preprocessing
* Modelling
* Conclusions
* Recommendations

## 1.4 Data Feature Description

The features of this dataset that have been used in modelling include: 'crash_record_id', 'rd_no', 'crash_date_est_i' , 'crash_date', 'posted_speed_limit', 'traffic_control_device', 'device_condition', 'weather_condition', 'lighting_condition', 'first_crash_type', 'trafficway_type', 'lane_cnt', 'alignment', 'roadway_surface_cond', 'road_defect' 'report_type', 'crash_type', 'intersection_related_i', 'not_right_of_way_i', 'hit_and_run_i', 'damage', 'date_police_notified', 'prim_contributory_cause', 'sec_contributory_cause', 'street_no', 'street_direction', 'street_name', 'beat_of_occurrence', 'photos_taken_i',

* statements_taken_i:
* dooring_i:
* work_zone_i:
* work_zone_type:
* workers_present_i:
* num_units:
* most_severe_injury, injuries_total, injuries_fatal, injuries_incapacitating, injuries_non_incapacitating,injuries_reported_not_evident, injuries_no_indication:
* injuries_unknown:
* crash_hour:
* crash_day_of_week:
* crash_month:
* latitude', longitude, location:

# 2. Data Exploration & Analysis

## 2.1 Data Cleaning & Exploration

In [1]:
# Relevant libraries
# importing libraries for data handling
import numpy as np
import pandas as pd
# importing libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
#import missingno as msno confirm how to import it and its purpose
import folium
import warnings
# importing libraries for data handling
import numpy as np
import pandas as pd

# importing libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
#import missingno as msno confirm how to import it and its purpose
import folium
import warnings
# importing libraries for modeling
from sklearn import metrics as metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# importing libraries for statistics
import scipy.stats as stats
# importing libraries for styling
plt.style.use('seaborn')
sns.set_style('whitegrid')

warnings.filterwarnings('ignore')

# For Mapping
import geopandas as gpd # geospatial data
from shapely.geometry import Point
import folium #interactive leaflet map
from folium.plugins import FloatImage
from shapely.geometry import Point


  plt.style.use('seaborn')


In [2]:
# Loading the dataset and asigning it to the variable "data" using pandas
data = pd.read_csv("Traffic_Crashes_-_Crashes.csv")
# Sampling the large dataset for analysis

In [3]:
# Sampling the data for analysis
data_sample = data.sample(n=60000, replace=False, random_state=123)
data_sample.reset_index(drop=True, inplace=True)
print(data_sample.shape)
data_sample.head()

(60000, 49)


Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,18304bb6fdb5804a3effd1a598a78bf3dc9c07111befe4...,JA484827,,10/24/2017 11:15:00 PM,25,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,...,0.0,0.0,1.0,0.0,23,3,10,41.853074,-87.618898,POINT (-87.618897962591 41.853074038781)
1,325b50af5f1b0f6d95622d524116d2cbd879bb0be9d93b...,JF461491,,11/04/2022 02:50:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,...,0.0,0.0,1.0,0.0,2,6,11,41.857849,-87.616841,POINT (-87.616840879395 41.857849087118)
2,bdfa523cdf7e77aa7cb9e58421f70435852dc4f536571f...,JC267785,,05/18/2019 05:20:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAWN,REAR END,...,0.0,0.0,3.0,0.0,5,7,5,41.857474,-87.685969,POINT (-87.685968604685 41.857474370968)
3,beb2bb00c87a0dacae6a7cf25018023c9337b8d5d6091b...,JC300323,,06/10/2019 01:05:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,...,0.0,0.0,2.0,0.0,13,2,6,41.706465,-87.68177,POINT (-87.681770004194 41.706464557469)
4,8ecd0b725fa8e42df7e3be837c5c4c08741c9108d79ea7...,JE238494,Y,05/22/2021 01:00:00 AM,20,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,1,7,5,41.916086,-87.805536,POINT (-87.805535993338 41.916085623272)


In [4]:
# Change the column names to lower case, and briefly describe the data types
def info(data):
    # Convert columns to lower case
    data.columns = data.columns.str.lower()
    # Data type description
    print("Data Info")
    print(data.info())
    # Check for missing values, and quantities
    print("--------------")
    print("Missing values")
    print(((data.isna().sum())/len(data)).sort_values(ascending = False))
    # Check for duplicates
    print("---------------------")
    print("Duplicates:", data.crash_record_id.duplicated().sum())

In [5]:
# Display the data info
info(data_sample)

Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 49 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash_record_id                60000 non-null  object 
 1   rd_no                          59592 non-null  object 
 2   crash_date_est_i               4501 non-null   object 
 3   crash_date                     60000 non-null  object 
 4   posted_speed_limit             60000 non-null  int64  
 5   traffic_control_device         60000 non-null  object 
 6   device_condition               60000 non-null  object 
 7   weather_condition              60000 non-null  object 
 8   lighting_condition             60000 non-null  object 
 9   first_crash_type               60000 non-null  object 
 10  trafficway_type                60000 non-null  object 
 11  lane_cnt                       16560 non-null  float64
 12  alignment                      60000

### Data Cleaning

This steps involves:

* Removal of duplicates, if any.
* Dealing with missing values
* Data preparation -- numerical & categorical DataFrames

In [6]:
# Function to deal with missing values
def missing_(data):
    # workers_present_i column: replace the null instances with "unknown"
    data["workers_present_i"].fillna("unknown", inplace=True)
    # dooring_i column: we take that the missing values represent areas with no accidents 
    # involving car doors.
    data["dooring_i"].fillna("N", inplace=True)
    # work_zone_type: replace the missing values with "None", since no construction was taking place
    data["work_zone_type"].fillna("None", inplace=True)
    # work_zone_i: replace the missing values with "N"--since no construction was taking place in these
    # places
    data["work_zone_i"].fillna("N", inplace=True)
    # Drop target unrelated columns with missing values
    data.drop(["photos_taken_i", "statements_taken_i", "crash_date_est_i", "not_right_of_way_i"], inplace=True, axis=1)
    # Fill missing values in column-"lane_cnt" with the midian value
    data["lane_cnt"].fillna(data_sample.lane_cnt.median(), inplace=True)
    # fill null values in the "hit_and_run_column" with "unknown"
    data["hit_and_run_i"].fillna("unknown", inplace=True)
    # Fillna in "intersection_related_i" with "unknown"
    data["intersection_related_i"].fillna("unknown", inplace=True)
    # Drop the data in columns of the latitude and longitudes where the value == 0
    # Drop the other null values in the dataset since they are minimal
    data.dropna(inplace=True) 

In [7]:
# Load the data to the function to handle the missing values
missing_(data_sample)

In [8]:
# Check to ensure that all the parameters have been updated
info(data_sample)

Data Info
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57418 entries, 0 to 59999
Data columns (total 45 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash_record_id                57418 non-null  object 
 1   rd_no                          57418 non-null  object 
 2   crash_date                     57418 non-null  object 
 3   posted_speed_limit             57418 non-null  int64  
 4   traffic_control_device         57418 non-null  object 
 5   device_condition               57418 non-null  object 
 6   weather_condition              57418 non-null  object 
 7   lighting_condition             57418 non-null  object 
 8   first_crash_type               57418 non-null  object 
 9   trafficway_type                57418 non-null  object 
 10  lane_cnt                       57418 non-null  float64
 11  alignment                      57418 non-null  object 
 12  roadway_surface_cond           57418

## 2.2 Data Analysis

* Univariate Analysis

In [None]:
# Split the dataset into numerical and categorical DataFrames
data_cat = data_sample.select_dtypes("object")
data_num = data_sample.select_dtypes("number")

* Bivariate Analysis

* Multivariate Analysis

# 3. Data Preprocessing

# 4. Modelling

# 5. Conclusions


# Recommenations