# Chicago, IL : Car Crash Analysis & Predictive Modeling


### *Predicting car crashes with Machine Learning Models*

Authors: [Christos Maglaras](mailto:Christo111M@gmail.com), [Marcos Panyagua](mailto:marcosvppfernandes@gmail.com), [Jamie Dowat](mailto:jamie_dowat44@yahoo.com)

![chicago](img/chicago_night_drive.jpg)



### Stakeholder: Chicago Department of Transportation

![cdot](img/cdot.png)

### Business Problem
A report from the National Safety Council estimates that even with a decrease of 13% of miles driven last year, 42,060 people died in a vehicle crash in 2020. That is a 8% increase compared to the same period in 2019.
A report from the National Safety Council estimates that even with a decrease of 13% of miles driven last year, 42,060 people died in a vehicle crash in 2020. That is a 8% increase compared to the same period in 2019.
For the Chicago Tribune, one of the media vehicles that the report was published, Ken Kolosh, the safety council’s manager of statistics said that “The pandemic appears to be taking our eyes off the ball when it comes to traffic safety”.
Michael Hanson, director of the Minnesota Public Safety Department’s Office of Traffic Safety, ratify the same: “We’re seeing a huge increase in the amount of risk-taking behavior”.
So we decided to take data from the City of Chicago portal and make a deeper analisis on the three datasets they have avaible and see if we can somehow pinpoint some of the causes and suggest implementations to improve the safety of the city.


## Project Requirements:

Problem First: Start with a problem that you are interested in that you could potentially solve with a classification model. Then look for data that you could use to solve that problem. This approach is high-risk, high-reward: Very rewarding if you are able to solve a problem you are invested in, but frustrating if you end up sinking lots of time in without finding appropriate data. To mitigate the risk, set a firm limit for the amount of time you will allow yourself to look for data before moving on to the Data First approach.

Data First: Take a look at some of the most popular internet repositories of cool data sets we've listed below. If you find a data set that's particularly interesting for you, then it's totally okay to build your problem around that data set.

### Car Crash data:

Build a classifier to predict the primary contributory cause of a car accident, given information about the car, the people in the car, the road conditions etc. You might imagine your audience as a Vehicle Safety Board who's interested in reducing traffic accidents, or as the City of Chicago who's interested in becoming aware of any interesting patterns. Note that there is a multi-class classification problem. You will almost certainly want to bin or trim or otherwise limit the number of target categories on which you ultimately predict. Note e.g. that some primary contributory causes have very few samples.

# EDA
Crash data shows information about each traffic crash on city streets within the **City of Chicago** limits and under the jurisdiction of Chicago Police Department (CPD). Data are shown as is from the electronic crash reporting system (E-Crash) at CPD, excluding any personally identifiable information. Records are added to the data portal when a crash report is finalized or when amendments are made to an existing report in E-Crash. Data from E-Crash are available for some police districts in 2015, but citywide data are not available until September 2017. About half of all crash reports, mostly minor crashes, are self-reported at the police district by the driver(s) involved and the other half are recorded at the scene by the police officer responding to the crash. Many of the crash parameters, including street condition data, weather condition, and posted speed limits, are recorded by the reporting officer based on best available information at the time, but many of these may disagree with posted information or other assessments on road conditions. If any new or updated information on a crash is received, the reporting officer may amend the crash report at a later time. A traffic crash within the city limits for which CPD is not the responding police agency, typically crashes on interstate highways, freeway ramps, and on local roads along the City boundary, are excluded from this dataset.

All crashes are recorded as per the format specified in the Traffic Crash Report, SR1050, of the Illinois Department of Transportation. **As per Illinois statute, only crashes with a property damage value of *1,500 or more* or involving bodily injury to any person(s) and that happen on a public roadway and that involve at least one moving vehicle, except bike dooring, are considered reportable crashes.** However, CPD records every reported traffic crash event, regardless of the statute of limitations, and hence any formal Chicago crash dataset released by Illinois Department of Transportation may not include all the crashes listed here.

## Data: [Chicago City Data Portal](https://data.cityofchicago.org/)

![ccdp](img/chicagocitydataportal.jpg)

### [Crashes](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if):

##### Number of Rows: 482,866

*Shows crash data from crash from the Chicago Police Department's **E-Crash** system*

**"All crashes are recorded as per the format specified in the Traffic Crash Report, SR1050, of the Illinois Department of Transportation."**

| Column Name                 | Description                |
| --------------------------- | -------------------------- |
| crash_record_id  |  Can be used to link to the same crash in the Vehicles and People datasets. |
| rd_no | Chicago Police Department report number|
| crash_date | Date and time of crash as entered by the reporting officer |
| posted_speed_limit  | Posted speed limit, as determined by reporting officer |
| traffic_control_device | Traffic control device present at crash location, as determined by reporting officer (signals, stop sign, etc) |
| device_condition  | Condition of traffic control device, as determined by reporting officer |
| weather_condition | Weather condition at time of crash, as determined by reporting officer |
| lighting_condition | Light condition at time of crash, as determined by reporting officer |
| first_crash_type | Type of first collision in crash |
| trafficway_type  | Trafficway type, as determined by reporting officer |
| lane_ct | Total number of through lanes in either direction, excluding turn lanes, as determined by reporting officer (0 = intersection)|
| alignment | Street alignment at crash location, as determined by reporting officer |
| roadway_surface_cond        | Road surface condition, as determined by reporting officer |
| road_defect | Road defects, as determined by reporting officer |
| crash_type | A general severity classification for the crash. Can be either Injury and/or Tow Due to Crash or No Injury / Drive Away |
| damage | A field observation of estimated damage. |
| prim_contributory_cause   | The factor which was most significant in causing the crash, as determined by officer judgment |
| sec_contributory_cause | The factor which was second most significant in causing the crash, as determined by officer judgment |
| street_name | Street address name of crash location, as determined by reporting officer|
| num_units | Number of units involved in the crash. A unit can be a motor vehicle, a pedestrian, a bicyclist, or another non-passenger roadway user. Each unit represents a mode of traffic with an independent trajectory. |
| most_severe_injury | Most severe injury sustained by any person involved in the crash |
| injuries_total | Total persons sustaining fatal, incapacitating, non-incapacitating, and possible injuries as determined by the reporting officer |
| injuries_fatal | Total persons sustaining fatal injuries in the crash |
| injuries_incapacitating | Total persons sustaining incapacitating/serious injuries in the crash as determined by the reporting officer. Any injury other than fatal injury, which prevents the injured person from walking, driving, or normally continuing the activities they were capable of performing before the injury occurred. Includes severe lacerations, broken limbs, skull or chest injuries, and abdominal injuries. |
| injuries_non_incapacitating | Total persons sustaining non-incapacitating injuries in the crash as determined by the reporting officer. Any injury, other than fatal or incapacitating injury, which is evident to observers at the scene of the crash. Includes lump on head, abrasions, bruises, and minor lacerations. |
| crash_hour | The hour of the day component of CRASH_DATE. |
| crash_day_of_week | The day of the week component of CRASH_DATE. Sunday=1 |
| latitude | The latitude of the crash location, as determined by reporting officer, as derived from the reported address of crash |
| longitude | The longitude of the crash location, as determined by reporting officer, as derived from the reported address of crash |


### [People](https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d):

##### Number of Rows: 1,068,637

*Information about people involved in a crash and if any injuries were sustained.*

| Column Name                 | Description                |
| --------------------------- | -------------------------- |
| crash_record_id | This number can be used to link to the same crash in the Crashes and Vehicles datasets. This number also serves as a unique ID in the Crashes dataset. |
| person_type | Type of roadway user involved in crash |
| rd_no | Chicago Police Department report number. For privacy reasons, this column is blank for recent crashes. |
| crash_date | Date and time of crash as entered by the reporting officer |
| seat_no | Code for seating position of motor vehicle occupant: 1= driver, 2= center front, 3 = front passenger, 4 = second row left, 5 = second row center, 6 = second row right, 7 = enclosed passengers, 8 = exposed passengers, 9= unknown position, 10 = third row left, 11 = third row center, 12 = third row right |
| city | City of residence of person involved in crash |
| state | State of residence of person involved in crash |
| zipcode | ZIP Code of residence of person involved in crash |
| sex | Gender of person involved in crash, as determined by reporting officer |
| age | Age of person involved in crash |
| drivers_license_state | State issuing driver's license of person involved in crash |
| drivers_license_class | Class of driver's license of person involved in crash |
| safety_equipment | Safety equipment used by vehicle occupant in crash, if any |
| airbag_deployed | Whether vehicle occupant airbag deployed as result of crash |
| ejection | Whether vehicle occupant was ejected or extricated from the vehicle as a result of crash |
| injury_classification | Severity of injury person sustained in the crash |
| driver_action | Driver action that contributed to the crash, as determined by reporting officer |
| driver_vision | What, if any, objects obscured the driver’s vision at time of crash |
| physical_condition | Driver’s apparent physical condition at time of crash, as observed by the reporting officer |
| pedpedal_action | Action of pedestrian or cyclist at the time of crash |
| pedpedal_visibility | Visibility of pedestrian of cyclist safety equipment in use at time of crash |
| pedpedal_location | Location of pedestrian or cyclist at the time of crash |
| bac_result | Status of blood alcohol concentration testing for driver or other person involved in crash |
| bac_result value | Driver’s blood alcohol concentration test result (fatal crashes may include pedestrian or cyclist results) |
| cell_phone_use | Whether person was/was not using cellphone at the time of the crash, as determined by the reporting officer |

### [Vehicles](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3):

##### Number of Rows: 987,148

*Information about vehicles ("units") involved in a traffic crash.*

| Column Name                 | Description                |
| --------------------------- | -------------------------- |
| crash_record_id | This number can be used to link to the same crash in the Crashes and People datasets. This number also serves as a unique ID in the Crashes dataset. |
| rd_no | Chicago Police Department report number. For privacy reasons, this column is blank for recent crashes. |
| crash_date | Date and time of crash as entered by the reporting officer |
| unit_type | The type of unit (i.e Driver, parked, pedestrian, bicycle, etc) |
| num_passengers | Number of passengers in the vehicle. The driver is not included. More information on passengers is in the People dataset. |
| make | The make (brand) of the vehicle, if relevant |
| model | The model of the vehicle, if relevant |
| lic_plate_state | The state issuing the license plate of the vehicle, if relevant |
| vehicle_year | The model year of the vehicle, if relevant |
| vehicle_defect | Indicates part of car containing defect (brakes, wheels, etc.) |
| vehicle_type | The type of vehicle, if relevant (passenger, truck, bus, etc) |
| vehicle_use | The normal use of the vehicle, if relevant |
| maneuver | The action the unit was taking prior to the crash, as determined by the reporting officer |
| towed_I | Indicator of whether the vehicle was towed |
| occupant_cnt | The number of people in the unit, as determined by the reporting officer |
| exceed_speed_limit_I | Indicator of whether the unit was speeding, as determined by the reporting officer |
| first_contact_point | Indicates orientation on car that was hit (front, rear, etc) |

## Repository Structure

├── data <br>
....├── traffic_crashes_chicago.csv<br>
....├── traffic_crashes_people.csv<br>
....└── traffic_crashes_vehicles.csv<br>
├── img<br>
....├── cdot.png<br>
....├── chicago_night_drive.jpg<br>
....├── chicagocitydataportal.jpg<br>
....├── finalmodeleltestcm.png<br>
....├── finalmodeltraincm.png<br>
....├── scoringsafety.png<br>
....└── visionzeroquotes.png<br>
├── notebooks<br>
....├── Christos_notebook.ipynb<br>
....├── jamie_notebook.ipynb<br>
....└── marcos_eda.ipynb<br>
├── src<br>
....├── __init__.py<br>
....├── data_cleaning.py<br>
....├── eda.py<br>
....└── models.py<br>
├──presentation.pdf<br>
├── final_notebook.ipynb<br>
└── README.md<br>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Source: https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if 
data = pd.read_csv('traffic_crashes_chicago.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'traffic_crashes_chicago.csv'

In [None]:
data.head()

In [None]:
def basic_info(data):
    print("Dataset shape is: ", data.shape)
    print("Dataset size is: ", data.size)
    print("Dataset columns are: ", data.columns)
    print("Dataset info is: ", data.info())
    categorical = []
    numerical = []
    for i in data.columns:
        if data[i].dtype == object:
            categorical.append(i)
        else:
            numerical.append(i)
    print("Categorical variables are:\n ", categorical)
    print("Numerical variables are:\n ", numerical)
    return categorical, numerical

In [None]:
len(data)

In [None]:
data.LATITUDE

In [None]:
data.CRASH_DATE_EST_I.value_counts()

In [None]:
data.isna().sum() 

In [None]:
data_fsm = data.fillna(value=0)

In [None]:
len(data_fsm)

In [None]:
lots_o_missing = ['CRASH_RECORD_ID', 'dead?']
for col in data_fsm.columns:
    if len(data_fsm[col][data_fsm[col]==0]) > 300000:
        lots_o_missing.append(col)
lots_o_missing

In [None]:
data['HIT_AND_RUN_I'].value_counts()

In [None]:
data_fsm['dead?'] = np.where(data_fsm['INJURIES_FATAL']>0, 1, 0)
data_fsm.head()

In [None]:
from sklearn.linear_model import LogisticRegression
X = data_fsm.drop(labels=lots_o_missing, axis=1).select_dtypes(exclude='object')
y = data_fsm['dead?']
lr = LogisticRegression()
lr.fit(X, y)


In [None]:
lr.__dict__

In [None]:
lr.score(X, y)

In [None]:
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree 
dt = DecisionTreeClassifier()
dt.fit(X, y)
dt.score(X, y)

In [None]:
data_fsm.head()

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(dt, X, y)

In [None]:
dt.get_params()

In [None]:
!pip install imblearn

In [None]:

from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_sample(X, y)


In [None]:
data.columns = [data.columns[i].lower() for i in range(len(data.columns))]

In [None]:
data.iloc[111].crash_date

In [None]:
data['month'] = data['crash_date'].apply(lambda x: int(x[:2]))
data['day'] = data['crash_date'].apply(lambda x: int(x[3:5]))
data['year'] = data['crash_date'].apply(lambda x: int(x[6:10]))
data['time_of_crash'] = data['crash_date'].apply(
    lambda x: int(x[11:13]+x[14:16]+x[17:19]) if 'AM' in x else int((str(int(x[11:13])+12))+x[14:16]+x[17:19])
)
data['time_of_crash'].iloc[111]




In [None]:
data.crash_date.dtype

In [None]:
data.columns

# Visualizations

## Folium

In [None]:
!pip install folium
import folium
from folium import plugins

In [None]:
chimap = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.raster_layers.TileLayer('Open Street Map').add_to(chimap)
folium.raster_layers.TileLayer('Stamen Toner').add_to(chimap)
folium.raster_layers.TileLayer('Stamen Watercolor').add_to(chimap)
folium.raster_layers.TileLayer('CartoDB Positron').add_to(chimap)
folium.raster_layers.TileLayer('CartoDB Dark_Matter').add_to(chimap)
folium.raster_layers.TileLayer('Stamen Terrain').add_to(chimap)


# folium.LayerControl().add_to(chimap)

minimap = plugins.MiniMap(toggle_display=True)
chimap.add_child(minimap)

plugins.Fullscreen(position='topright').add_to(chimap)

draw = plugins.Draw(export=True)
draw.add_to(chimap)
display(chimap)

In [None]:
zipmap = folium.Map(location=[41.878876, -87.635918],
                     zoom_start = 10)
zipcodes = "https://data.cityofchicago.org/api/geospatial/gdcf-axmw?method=export&format=GeoJSON"
folium.GeoJson(zipcodes, name="Chicago Zipcodes").add_to(zipmap)
dataheat['LIGHTING_CONDITION'].value_counts()

# Start of lighting_condition mapping

In [None]:
dataheat = data.dropna(subset = ['LATITUDE'])

dataheatdaylight = dataheat[dataheat['LIGHTING_CONDITION'] == 'DAYLIGHT']
dataheatdarknesslight = dataheat[dataheat['LIGHTING_CONDITION'] == 'DARKNESS, LIGHTED ROAD']
dataheatdarkness = dataheat[dataheat['LIGHTING_CONDITION'] == 'DARKNESS']
dataheatunknown = dataheat[dataheat['LIGHTING_CONDITION'] == 'UNKNOWN']
dataheatdusk = dataheat[dataheat['LIGHTING_CONDITION'] == 'DUSK']
dataheatdawn = dataheat[dataheat['LIGHTING_CONDITION'] == 'DAWN']

In [None]:
folium.plugins.HeatMap(list(zip(dataheat['LATITUDE'], dataheat['LONGITUDE'])), radius=2, blur=3).add_to(chimap)
folium.LayerControl().add_to(chimap)
display(chimap)

In [None]:
chimapdaylight = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(dataheatdaylight['LATITUDE'], dataheatdaylight['LONGITUDE'])), radius=2, blur=3).add_to(chimapdaylight)
folium.LayerControl().add_to(chimapdaylight)
plugins.Fullscreen(position='topright').add_to(chimapdaylight)

display(chimapdaylight)

In [None]:
chimapdarknesslight = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(dataheatdarknesslight['LATITUDE'], dataheatdarknesslight['LONGITUDE'])), radius=2, blur=3).add_to(chimapdarknesslight)
folium.LayerControl().add_to(chimapdarknesslight)
plugins.Fullscreen(position='topright').add_to(chimapdarknesslight)

display(chimapdarknesslight)

In [None]:
chimapdarkness = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(dataheatdarkness['LATITUDE'], dataheatdarkness['LONGITUDE'])), radius=2, blur=3).add_to(chimapdarkness)
folium.LayerControl().add_to(chimapdarkness)
plugins.Fullscreen(position='topright').add_to(chimapdarkness)

display(chimapdarkness)

In [None]:
chimapunknown = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(dataheatunknown['LATITUDE'], dataheatunknown['LONGITUDE'])), radius=2, blur=3).add_to(chimapunknown)
folium.LayerControl().add_to(chimapunknown)
plugins.Fullscreen(position='topright').add_to(chimapunknown)

display(chimapunknown)

In [None]:
chimapdusk = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(dataheatdusk['LATITUDE'], dataheatdusk['LONGITUDE'])), radius=2, blur=3).add_to(chimapdusk)
folium.LayerControl().add_to(chimapdusk)
plugins.Fullscreen(position='topright').add_to(chimapdusk)

display(chimapdusk)

In [None]:
chimapdawn = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(dataheatdawn['LATITUDE'], dataheatdawn['LONGITUDE'])), radius=2, blur=3).add_to(chimapdawn)
folium.LayerControl().add_to(chimapdawn)
plugins.Fullscreen(position='topright').add_to(chimapdawn)

display(chimapdawn)

# End of lighting_condition mapping

In [None]:
pd.set_option('display.max_columns', None)
data.head()

In [None]:
inj_acc = data[data['INJURIES_TOTAL'] > 0]
fat_acc = inj_acc[inj_acc['INJURIES_FATAL'] > 0]
print(f'{(len(fat_acc)/len(inj_acc))*100} percent of injurious accidents result in deaths')

## Start of injuries_total mapping

In [None]:
datainj = data[data['INJURIES_TOTAL'] > 0]
datainjuries = datainj.dropna(subset = ['LATITUDE'])
datainjuries['LATITUDE'].isna().sum()


chiinjuries = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(datainjuries['LATITUDE'], datainjuries['LONGITUDE'])), radius=2, blur=3).add_to(chiinjuries)
folium.LayerControl().add_to(chiinjuries)
plugins.Fullscreen(position='topright').add_to(chiinjuries)

display(chiinjuries)

In [None]:
datainjy = data[data['INJURIES_TOTAL'] == 0]
datainjuriesy = datainjy.dropna(subset = ['LATITUDE'])
datainjuriesy['LATITUDE'].isna().sum()


chiinjuriesy = folium.Map(location=[41.878876, -87.635918],
                    zoom_start = 12,
                    control_scale=True,
                   tiles = "OpenStreetMap")

folium.plugins.HeatMap(list(zip(datainjuriesy['LATITUDE'], datainjuriesy['LONGITUDE'])), radius=2, blur=3).add_to(chiinjuriesy)
folium.LayerControl().add_to(chiinjuriesy)
plugins.Fullscreen(position='topright').add_to(chiinjuriesy)

display(chiinjuriesy)

## End of injuries_total mapping

## Start of weekday graphing

In [None]:
crash_no_inj = data[data['INJURIES_TOTAL'] == 0]
crash_inj    = data[data['INJURIES_TOTAL'] > 0]

crash_no_inj = crash_no_inj.groupby('CRASH_DAY_OF_WEEK').sum()
crash_inj    = crash_inj.groupby('CRASH_DAY_OF_WEEK').sum()

cni = crash_no_inj['NUM_UNITS']
cyi = crash_inj['NUM_UNITS']

cni = pd.DataFrame(cni)
cyi = pd.DataFrame(cyi)

f, ax = plt.subplots(figsize=(10, 10))

sns.set_color_codes("muted")
sns.barplot(x=cni.index, y="NUM_UNITS", data=cni,
            label="No Injury Accidents", color="m")

sns.set_color_codes("muted")
sns.barplot(x=cyi.index, y="NUM_UNITS", data=cyi,
            label="Injury Accidents", color="r")

ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="Crashes", xlabel="Day of Week")
sns.despine(left=True, bottom=True)

## End of weekday graphing

## Start of month graphing

In [None]:
crash_no_inj = data[data['INJURIES_TOTAL'] == 0]
crash_inj    = data[data['INJURIES_TOTAL'] > 0]

crash_no_inj = crash_no_inj.groupby('CRASH_MONTH').sum()
crash_inj    = crash_inj.groupby('CRASH_MONTH').sum()

cni = crash_no_inj['NUM_UNITS']
cyi = crash_inj['NUM_UNITS']

cni = pd.DataFrame(cni)
cyi = pd.DataFrame(cyi)




import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

f, ax = plt.subplots(figsize=(10, 10))

sns.set_color_codes("muted")
sns.barplot(x=cni.index, y="NUM_UNITS", data=cni,
            label="No Injury Accidents", color="m")

sns.set_color_codes("muted")
sns.barplot(x=cyi.index, y="NUM_UNITS", data=cyi,
            label="Injury Accidents", color="r")

ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="Crashes", xlabel="Month")
sns.despine(left=True, bottom=True)

## End of month graphing

## Start hourly graphing

In [None]:
crash_no_inj = data[data['INJURIES_TOTAL'] == 0]
crash_inj    = data[data['INJURIES_TOTAL'] > 0]

crash_no_inj = crash_no_inj.groupby('CRASH_HOUR').sum()
crash_inj    = crash_inj.groupby('CRASH_HOUR').sum()

cni = crash_no_inj['NUM_UNITS']
cyi = crash_inj['NUM_UNITS']

cni = pd.DataFrame(cni)
cyi = pd.DataFrame(cyi)




import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

f, ax = plt.subplots(figsize=(10, 10))

sns.set_color_codes("muted")
sns.barplot(x=cni.index, y="NUM_UNITS", data=cni,
            label="No Injury Accidents", color="m")

sns.set_color_codes("muted")
sns.barplot(x=cyi.index, y="NUM_UNITS", data=cyi,
            label="Injury Accidents", color="r")

ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="Crashes", xlabel="Hour")
sns.despine(left=True, bottom=True)

## End hourly graphing

# XGBOOST

In [None]:
import pandas as pd
df = pd.read_csv('traffic_crashes_chicago.csv')

df
# = df.drop(['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE', 'TRAFFIC_CONTROL_DEVICE',
              'DEVICE_CONDITION', 'FIRST_CRASH_TYPE', 'TRAFFICWAY_TYPE', 'ALIGNMENT', 'REPORT_TYPE',
              'CRASH_TYPE', 'INTERSECTION_RELATED_I', 'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE',
              'DATE_POLICE_NOTIFIED', 'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NAME',
              'BEAT_OF_OCCURRENCE', 'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
              'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'MOST_SEVERE_INJURY', 'STREET_NO', 'INJURIES_FATAL',
              'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING', 'INJURIES_REPORTED_NOT_EVIDENT',
              'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN', 'LATITUDE', 'LONGITUDE',  'LOCATION', 'LANE_CNT'], axis=1)
df['STREET_DIRECTION'].fillna(method='ffill', inplace=True)
df['target-injuries'] = df['INJURIES_TOTAL'] > 0
df.drop('INJURIES_TOTAL', axis=1, inplace=True)

from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
df['WEATHER_CONDITION'] = lbl.fit_transform(df['WEATHER_CONDITION'].astype(str))
df['LIGHTING_CONDITION'] = lbl.fit_transform(df['LIGHTING_CONDITION'].astype(str))
df['ROADWAY_SURFACE_COND'] = lbl.fit_transform(df['ROADWAY_SURFACE_COND'].astype(str))
df['ROAD_DEFECT'] = lbl.fit_transform(df['ROAD_DEFECT'].astype(str))
df['STREET_DIRECTION'] = lbl.fit_transform(df['STREET_DIRECTION'].astype(str))

X = df.drop('target-injuries', axis=1)
y = df['target-injuries']

In [None]:
df.columns

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
import xgboost as xgb

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)
D_train = xgb.DMatrix(X_train, label=Y_train)
D_test  = xgb.DMatrix(X_test, label=Y_test)

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
# params = {
#         'min_child_weight': [5,6,7,8],
#         'gamma'           : [1.1,1.2,1.3],
#         'subsample'       : [.7,.8,.9],
#         'max_depth'       : [10,11,12,13],
#         'eta'             : [.2,.3,.4],
#         'colsample_bytree': [.4,.5,.6]        
#         }
# subsample': 0.8, 'max_depth': 10, 'gamma': 1, 'colsample_bytree': 0.6
# subsample': 0.8, 'max_depth': 11, 'gamma': 1.1, 'eta': 0.3, 'colsample_bytree': 0.7
# subsample': 0.8, 'min_child_weight': 6, 'max_depth': 11, 'gamma': 1.1, 'eta': 0.3, 'colsample_bytree': 0.5

In [None]:
# num = 1
# for each in params.values():
#     num = num * len(each)
#     print(f'Thre are {num*folds} combinations')
# print(f'{(folds*param_comb*6*3)/60} minutes')

In [None]:
# xgb = XGBClassifier(learning_rate=0.02,
#                     n_estimators=600,
#                     objective='binary:logistic',
#                     silent=True,
#                     nthread=1,
#                     tree_method= 'gpu_hist'
# #                     verbosity=0,
# #                    scale_pos_weight = 7
#                    )


In [None]:
# folds = 5
# param_comb = 50

In [None]:
# skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

# random_search = RandomizedSearchCV(xgb,
#                                    param_distributions=params,
#                                    n_iter=param_comb,
#                                    scoring='roc_auc',
#                                    n_jobs=4,
#                                    cv=skf.split(X_train,Y_train),
#                                    verbose=3,
#                                    random_state=1001 )

# start_time = timer(None)
# random_search.fit(X_train, Y_train)
# timer(start_time)

In [None]:
# print('\n All results:')
# print(random_search.cv_results_)
# print('\n Best estimator:')
# print(random_search.best_estimator_)
# print('\n Best normalized gini score for %d-fold search with %d parameter combinations:' % (folds, param_comb))
# print(random_search.best_score_)
# #       * 2 - 1)
# print('\n Best hyperparameters:')
# print(random_search.best_params_)
# results = pd.DataFrame(random_search.cv_results_)
# results.to_csv('xgb-random-grid-search-results-01.csv', index=False)

## Bayes parameter search

In [None]:
pip install bayesian-optimization

In [None]:
Y_train = Y_train.astype('int')
Y_train

In [None]:
from xgboost import XGBClassifier
classifier1 = XGBClassifier().fit(X_train, Y_train)

train_p1 = classifier1.predict(X_train)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, hamming_loss
print(classification_report(Y_train, train_p1))

In [None]:
sum(train_p1)

In [None]:
cm = confusion_matrix(train_p1, Y_train)
acc = cm.diagonal().sum()/cm.sum()
print(acc)

In [None]:
from bayes_opt import BayesianOptimization
import xgboost as xgb
from sklearn.metrics import mean_squared_error, confusion_matrix

In [None]:
def bo_tune_xgb(max_depth, gamma, n_estimators ,learning_rate, scale_pos_weight, min_child_weight, colsample_bytree):
    params = {'max_depth'       : int(max_depth),
              'gamma'           : gamma,
              'n_estimators'    : int(n_estimators),
              'learning_rate'   : learning_rate,
              'subsample'       : 0.8,
              'eval_metric'     : 'rmse',
              'min_child_weight': min_child_weight,
              'scale_pos_weight': scale_pos_weight,
              'colsample_bytree': colsample_bytree,
              'tree_method'     : 'gpu_hist'}
    cv_result = xgb.cv(params, D_train, num_boost_round=200, nfold=5)
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

In [None]:
xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth' : (3, 15),
                        'gamma' : (0, 2),
                        'learning_rate'    : (0,1),
                        'n_estimators'     : (100,400),
                        'scale_pos_weight' : (5,10),
                        'min_child_weight' : (1,10),
                        'colsample_bytree' : (0,1)})
xgb_bo

In [None]:
xgb_bo.maximize(n_iter=10, init_points=12, acq='ei')

In [None]:
params = xgb_bo.max['params']
print(params)


In [None]:
params['max_depth']= int(params['max_depth'])
params['n_estimators']= int(params['n_estimators'])

In [None]:
from xgboost import XGBClassifier


In [None]:
classifier2 = XGBClassifier(**params).fit(X_train, Y_train)



In [None]:
train_p2 = classifier2.predict(X_train)



In [None]:
print(classification_report(train_p2, Y_train))



In [None]:
cm = confusion_matrix(train_p2, Y_train)
acc = cm.diagonal().sum()/cm.sum()
print(acc)

In [None]:
sum(train_p2)

In [None]:
import seaborn as sns
cf = confusion_matrix(train_p2, Y_train)
sns.heatmap(cf/np.sum(cf), annot=True, fmt='.2%')

In [None]:
import seaborn as sns
cf = confusion_matrix(train_p2, Y_train)
sns.heatmap(cf/np.sum(cf), annot=True, fmt='.2%')