# Overview of Chicago Crash Data

In [1]:
import pandas as pd

from pandas_profiling import ProfileReport
from Datafun import column_cutter

  import pandas.util.testing as tm


This analysis will consist of three different sets of data in regards to crashes in the city of Chicago. 
* Traffic_Crashes_-_Crashes.csv - Specific information about each crash recorded (https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if)
* Traffic_Crashes_-_People.csv - Information about the driver and passengers inside the vehicle (https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d)
* Traffic_Crashes_-_Vehicles.csv - Information about the specific vehicles involved in the crash (https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3)

In [2]:
Crashes = pd.read_csv('../raw_data/Traffic_Crashes_-_Crashes.csv')
Crashes.head()

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,073682ef84ff827659552d4254ad1b98bfec24935cc9cc...,JB460108,,10/02/2018 06:30:00 PM,10,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,18,3,10,,,
1,1560fb8a1e32b528fef8bfd677d2b3fc5ab37278b157fa...,JC325941,,06/27/2019 04:00:00 PM,45,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,...,0.0,0.0,2.0,0.0,16,5,6,,,
2,009e9e67203442370272e1a13d6ee51a4155dac65e583d...,JA329216,,06/30/2017 04:00:00 PM,35,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,...,0.0,0.0,3.0,0.0,16,6,6,41.741804,-87.740954,POINT (-87.740953581987 41.741803598989)
3,00e47f189660cd8ba1e85fc63061bf1d8465184393f134...,JC194776,,03/21/2019 10:50:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,...,0.0,0.0,2.0,0.0,22,5,3,41.741804,-87.740954,POINT (-87.740953581987 41.741803598989)
4,0126747fc9ffc0edc9a38abb83d80034f897db0f739eef...,JB200478,,03/26/2018 02:23:00 PM,35,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,...,0.0,0.0,2.0,0.0,14,2,3,41.953647,-87.732082,POINT (-87.732081736006 41.953646899951)


#### A profile report was created for this data using:

crash = ProfileReport(Crashes)

crash.to_file(output_file='crash_data.html')

The html file can be found in the same folder as this file for review of the original data.

In [3]:
Crashes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416198 entries, 0 to 416197
Data columns (total 49 columns):
CRASH_RECORD_ID                  416198 non-null object
RD_NO                            412429 non-null object
CRASH_DATE_EST_I                 30836 non-null object
CRASH_DATE                       416198 non-null object
POSTED_SPEED_LIMIT               416198 non-null int64
TRAFFIC_CONTROL_DEVICE           416198 non-null object
DEVICE_CONDITION                 416198 non-null object
WEATHER_CONDITION                416198 non-null object
LIGHTING_CONDITION               416198 non-null object
FIRST_CRASH_TYPE                 416198 non-null object
TRAFFICWAY_TYPE                  416198 non-null object
LANE_CNT                         198555 non-null float64
ALIGNMENT                        416198 non-null object
ROADWAY_SURFACE_COND             416198 non-null object
ROAD_DEFECT                      416198 non-null object
REPORT_TYPE                      406344 non-null o

Already identifying numerous columns that most likely should be removed due to overwhelming amounts of missing data.
Columns to remove based on missing values:
* crash_date_est_i
* lane_cnt  - further investigation needed
* intersection_related_i  - further investigation needed
* not_right_of_way_i
* hit_and_run_i  - further investigation needed
* photos_taken_i
* statements_taken_i
* dooring_i
* work_zone_i
* work_zone_type
* workers_present_i

### Let's go ahead and get a rough idea of how many different types of crashes the data has identified

In [5]:
Crashes['PRIM_CONTRIBUTORY_CAUSE'].nunique()

40

In [6]:
Crashes['PRIM_CONTRIBUTORY_CAUSE'].value_counts(normalize=True)

UNABLE TO DETERMINE                                                                 0.363260
FAILING TO YIELD RIGHT-OF-WAY                                                       0.112031
FOLLOWING TOO CLOSELY                                                               0.109849
NOT APPLICABLE                                                                      0.053941
IMPROPER OVERTAKING/PASSING                                                         0.048227
IMPROPER BACKING                                                                    0.044988
FAILING TO REDUCE SPEED TO AVOID CRASH                                              0.042309
IMPROPER LANE USAGE                                                                 0.040082
IMPROPER TURNING/NO SIGNAL                                                          0.033717
DRIVING SKILLS/KNOWLEDGE/EXPERIENCE                                                 0.031093
DISREGARDING TRAFFIC SIGNALS                                          

In [7]:
Crashes['SEC_CONTRIBUTORY_CAUSE'].nunique()

40

In [8]:
Crashes['SEC_CONTRIBUTORY_CAUSE'].value_counts(normalize=True)

NOT APPLICABLE                                                                      0.401468
UNABLE TO DETERMINE                                                                 0.355497
FAILING TO REDUCE SPEED TO AVOID CRASH                                              0.042134
DRIVING SKILLS/KNOWLEDGE/EXPERIENCE                                                 0.032119
FAILING TO YIELD RIGHT-OF-WAY                                                       0.030834
FOLLOWING TOO CLOSELY                                                               0.028462
IMPROPER OVERTAKING/PASSING                                                         0.015000
IMPROPER LANE USAGE                                                                 0.014834
WEATHER                                                                             0.012826
IMPROPER TURNING/NO SIGNAL                                                          0.010135
IMPROPER BACKING                                                      

40 different classes is far too many in my opinion and will need to be consolidated. Possible idea to separate out the UNABLE TO DETERMINE causes and attempt to classify this data after creating a model on the remaining information. 
Grouping considerations:
* Driver Error:
        * failing to reduce speed to avoid crash
        * failing to yield right-of-way
        * following too closely
        * improper overtaking/passing
        * improper lane usage
        * improper turning/no signal
        * improper backing
        * operating vehicle in erratic, reckless, careless, negligent, or agressive manner
        * disregarding traffic signals
        * exceeding authorized speed limit
        * exceeding safe speed for conditions
        * disregarding stop sign
        * driving on wrong side/wrong way
        * disregarding road markings
        * disregarding other traffic signs
        * disregarding yield sign
        * passing stopped school bus
* Road/Vision Obstruction
        * evasive action due to animal, object, nonmotorist
        * vision obscured (signs, tree limbs, buildings, etc. )
        * animal
        * obstructed crosswalks
* Driver Distraction
        * distraction - from inside vehicle
        * distraction - from outside vehicle
        * cell phone use other than texting
        * texting
        

In [6]:
Vehicles = pd.read_csv('../raw_data/Traffic_Crashes_-_Vehicles.csv')
Vehicles.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,CRASH_UNIT_ID,CRASH_RECORD_ID,RD_NO,CRASH_DATE,UNIT_NO,UNIT_TYPE,NUM_PASSENGERS,VEHICLE_ID,CMRC_VEH_I,MAKE,...,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT,VEHICLE_CONFIG,CARGO_BODY_TYPE,LOAD_TYPE,HAZMAT_OUT_OF_SERVICE_I,MCS_OUT_OF_SERVICE_I,HAZMAT_CLASS
0,10,2e31858c0e411f0bdcb337fb7c415aa93763cf2f23e02f...,HY368708,08/04/2015 12:40:00 PM,1,DRIVER,,10.0,,FORD,...,,,,,,,,,,
1,100,e73b35bd7651b0c6693162bee0666db159b28901437009...,HY374018,07/31/2015 05:50:00 PM,1,DRIVER,,96.0,,NISSAN,...,,,,,,,,,,
2,1000,f2b1adeb85a15112e4fb7db74bff440d6ca53ff7a21e10...,HY407431,09/02/2015 11:45:00 AM,1,DRIVER,,954.0,,FORD,...,,,,,,,,,,
3,10000,15a3e24fce3ce7cd2b02d44013d1a93ff2fbdca80632ec...,HY484148,10/31/2015 09:30:00 PM,2,DRIVER,,9561.0,,HYUNDAI,...,,,,,,,,,,
4,100000,1d3c178880366c77deaf06b8c3198429112a1c8e8807ed...,HZ518934,11/16/2016 01:00:00 PM,2,PARKED,,96745.0,,"TOYOTA MOTOR COMPANY, LTD.",...,,,,,,,,,,


#### A profile report was created for this data using:

vehicle = ProfileReport(Vehicles)

vehicle.to_file(output_file='vehicle_data.html')

The html file can be found in the same folder as this file for review of the original data.

In [10]:
Vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 837766 entries, 0 to 837765
Data columns (total 72 columns):
CRASH_UNIT_ID               837766 non-null int64
CRASH_RECORD_ID             837766 non-null object
RD_NO                       830047 non-null object
CRASH_DATE                  837766 non-null object
UNIT_NO                     837766 non-null int64
UNIT_TYPE                   837755 non-null object
NUM_PASSENGERS              119004 non-null float64
VEHICLE_ID                  820312 non-null float64
CMRC_VEH_I                  15506 non-null object
MAKE                        820281 non-null object
MODEL                       818348 non-null object
LIC_PLATE_STATE             753051 non-null object
VEHICLE_YEAR                689349 non-null float64
VEHICLE_DEFECT              820312 non-null object
VEHICLE_TYPE                820312 non-null object
VEHICLE_USE                 820312 non-null object
TRAVEL_DIRECTION            820312 non-null object
MANEUVER              

In [8]:
People = pd.read_csv('../raw_data/Traffic_Crashes_-_People.csv')
People.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,PERSON_ID,PERSON_TYPE,CRASH_RECORD_ID,RD_NO,VEHICLE_ID,CRASH_DATE,SEAT_NO,CITY,STATE,ZIPCODE,...,EMS_RUN_NO,DRIVER_ACTION,DRIVER_VISION,PHYSICAL_CONDITION,PEDPEDAL_ACTION,PEDPEDAL_VISIBILITY,PEDPEDAL_LOCATION,BAC_RESULT,BAC_RESULT VALUE,CELL_PHONE_USE
0,O10,DRIVER,2e31858c0e411f0bdcb337fb7c415aa93763cf2f23e02f...,HY368708,10.0,08/04/2015 12:40:00 PM,,CHICAGO,IL,60641.0,...,,FAILED TO YIELD,UNKNOWN,NORMAL,,,,TEST NOT OFFERED,,
1,O100,DRIVER,e73b35bd7651b0c6693162bee0666db159b28901437009...,HY374018,96.0,07/31/2015 05:50:00 PM,,ELK GROVE,IL,60007.0,...,,FOLLOWED TOO CLOSELY,UNKNOWN,NORMAL,,,,TEST NOT OFFERED,,
2,O1000,DRIVER,f2b1adeb85a15112e4fb7db74bff440d6ca53ff7a21e10...,HY407431,954.0,09/02/2015 11:45:00 AM,,CHICAGO,IL,,...,,UNKNOWN,UNKNOWN,NORMAL,,,,TEST NOT OFFERED,,
3,O10000,DRIVER,15a3e24fce3ce7cd2b02d44013d1a93ff2fbdca80632ec...,HY484148,9561.0,10/31/2015 09:30:00 PM,,SKOKIE,IL,60076.0,...,,NONE,NOT OBSCURED,NORMAL,,,,TEST NOT OFFERED,,
4,O100001,DRIVER,2fcefeab458932d8b1b12e103c18c50adc659943cccd4b...,HZ525619,96762.0,11/15/2016 05:45:00 PM,,,,,...,,UNKNOWN,UNKNOWN,UNKNOWN,,,,TEST NOT OFFERED,,


#### A profile report was created for this data using:

people = ProfileReport(People, title='People Data Profile Report')

people.to_file(output_file='people_data.html')

The html file can be found in the same folder as this file for review of the original data.

In [13]:
People.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 897406 entries, 0 to 897405
Data columns (total 30 columns):
PERSON_ID                897406 non-null object
PERSON_TYPE              897406 non-null object
CRASH_RECORD_ID          897406 non-null object
RD_NO                    890075 non-null object
VEHICLE_ID               880130 non-null float64
CRASH_DATE               897406 non-null object
SEAT_NO                  175790 non-null float64
CITY                     668965 non-null object
STATE                    674867 non-null object
ZIPCODE                  612691 non-null object
SEX                      884968 non-null object
AGE                      645593 non-null float64
DRIVERS_LICENSE_STATE    541769 non-null object
DRIVERS_LICENSE_CLASS    474861 non-null object
SAFETY_EQUIPMENT         894886 non-null object
AIRBAG_DEPLOYED          880521 non-null object
EJECTION                 886384 non-null object
INJURY_CLASSIFICATION    897058 non-null object
HOSPITAL              

In [17]:
People['DRIVER_ACTION'].value_counts(normalize=True)

NONE                                 0.370379
UNKNOWN                              0.226758
FAILED TO YIELD                      0.095709
OTHER                                0.083829
FOLLOWED TOO CLOSELY                 0.068120
IMPROPER BACKING                     0.032928
IMPROPER LANE CHANGE                 0.027877
IMPROPER TURN                        0.026984
IMPROPER PASSING                     0.022546
TOO FAST FOR CONDITIONS              0.016481
DISREGARDED CONTROL DEVICES          0.015196
IMPROPER PARKING                     0.004057
WRONG WAY/SIDE                       0.003484
CELL PHONE USE OTHER THAN TEXTING    0.001757
EVADING POLICE VEHICLE               0.001692
EMERGENCY VEHICLE ON CALL            0.000935
OVERCORRECTED                        0.000646
TEXTING                              0.000451
STOPPED SCHOOL BUS                   0.000126
LICENSE RESTRICTIONS                 0.000042
Name: DRIVER_ACTION, dtype: float64

In [30]:
print(Crashes['CRASH_RECORD_ID'].nunique())
print(Vehicles['CRASH_RECORD_ID'].nunique())
People['CRASH_RECORD_ID'].nunique()

416198
414426


410559

In [29]:
People['PERSON_TYPE'].value_counts()

DRIVER                 704250
PASSENGER              175790
PEDESTRIAN              10482
BICYCLE                  5974
NON-MOTOR VEHICLE         750
NON-CONTACT VEHICLE       160
Name: PERSON_TYPE, dtype: int64