## Author: Jack Robbins
### Motor Vehicle Collisions in NYC, an exploration and regression analysis

In [41]:
# Important imports
import pandas as pd
import numpy as np

In [42]:
# Read in our dataframe
collisions = pd.read_csv("data/Motor_Vehicle_Collisions_-_Crashes_20241018.csv")

  collisions = pd.read_csv("data/Motor_Vehicle_Collisions_-_Crashes_20241018.csv")


In [43]:
# Let's get an idea of our data
collisions.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,09/11/2021,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,...,Unspecified,,,,4455765,Sedan,Sedan,,,
1,03/26/2022,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,...,,,,,4513547,Sedan,,,,
2,06/29/2022,6:55,,,,,,THROGS NECK BRIDGE,,,...,Unspecified,,,,4541903,Sedan,Pick-up Truck,,,
3,09/11/2021,9:35,BROOKLYN,11208.0,40.667202,-73.8665,"(40.667202, -73.8665)",,,1211 LORING AVENUE,...,,,,,4456314,Sedan,,,,
4,12/14/2021,8:13,BROOKLYN,11233.0,40.683304,-73.917274,"(40.683304, -73.917274)",SARATOGA AVENUE,DECATUR STREET,,...,,,,,4486609,,,,,


In [44]:
# We can see that we have quit a few rows
collisions.shape

(2127188, 29)

## Data Preprocessing 

In [45]:
null_values=collisions.isnull().sum()
print("Detecting missing values:\n", null_values)

Detecting missing values:
 CRASH DATE                             0
CRASH TIME                             0
BOROUGH                           661545
ZIP CODE                          661805
LATITUDE                          238998
LONGITUDE                         238998
LOCATION                          238998
ON STREET NAME                    455507
CROSS STREET NAME                 810739
OFF STREET NAME                  1764054
NUMBER OF PERSONS INJURED             18
NUMBER OF PERSONS KILLED              31
NUMBER OF PEDESTRIANS INJURED          0
NUMBER OF PEDESTRIANS KILLED           0
NUMBER OF CYCLIST INJURED              0
NUMBER OF CYCLIST KILLED               0
NUMBER OF MOTORIST INJURED             0
NUMBER OF MOTORIST KILLED              0
CONTRIBUTING FACTOR VEHICLE 1       7150
CONTRIBUTING FACTOR VEHICLE 2     333415
CONTRIBUTING FACTOR VEHICLE 3    1974243
CONTRIBUTING FACTOR VEHICLE 4    2092457
CONTRIBUTING FACTOR VEHICLE 5    2117737
COLLISION_ID                  

## Let's analyze these findings...
So we can see that there are a lot of missing values specifically for vehicles above code type 2. This is probably because there aren't that many 3, 4 or 5 car collisions in NYC. So instead of dropping rows where these are null, we may as well simply drop these columns. This also applies to the "CONTRIBUTING FACTOR VEHICLE.." columns for 3, 4 and 5. In fact, I am going to drop the rows where these aren't null, because I only care about one/two car collisions.

In [46]:
collisions.drop(['VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5', \
                 'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4', \
                 'CONTRIBUTING FACTOR VEHICLE 5'], axis=1, inplace=True)

In [47]:
collisions.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
0,09/11/2021,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,...,0,0,0,2,0,Aggressive Driving/Road Rage,Unspecified,4455765,Sedan,Sedan
1,03/26/2022,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,...,0,0,0,1,0,Pavement Slippery,,4513547,Sedan,
2,06/29/2022,6:55,,,,,,THROGS NECK BRIDGE,,,...,0,0,0,0,0,Following Too Closely,Unspecified,4541903,Sedan,Pick-up Truck
3,09/11/2021,9:35,BROOKLYN,11208.0,40.667202,-73.8665,"(40.667202, -73.8665)",,,1211 LORING AVENUE,...,0,0,0,0,0,Unspecified,,4456314,Sedan,
4,12/14/2021,8:13,BROOKLYN,11233.0,40.683304,-73.917274,"(40.683304, -73.917274)",SARATOGA AVENUE,DECATUR STREET,,...,0,0,0,0,0,,,4486609,,


### Dropping NA's
Let's remove anything where the position or location of the crash was improperly recorded.

In [48]:
collisions.dropna(subset=['BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 'LOCATION'], how='any', inplace=True)

As we can see now, we should have borough and position data for every single crash

In [49]:
collisions

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
3,09/11/2021,9:35,BROOKLYN,11208.0,40.667202,-73.866500,"(40.667202, -73.8665)",,,1211 LORING AVENUE,...,0,0,0,0,0,Unspecified,,4456314,Sedan,
4,12/14/2021,8:13,BROOKLYN,11233.0,40.683304,-73.917274,"(40.683304, -73.917274)",SARATOGA AVENUE,DECATUR STREET,,...,0,0,0,0,0,,,4486609,,
7,12/14/2021,8:17,BRONX,10475.0,40.868160,-73.831480,"(40.86816, -73.83148)",,,344 BAYCHESTER AVENUE,...,0,0,0,2,0,Unspecified,Unspecified,4486660,Sedan,Sedan
8,12/14/2021,21:10,BROOKLYN,11207.0,40.671720,-73.897100,"(40.67172, -73.8971)",,,2047 PITKIN AVENUE,...,0,0,0,0,0,Driver Inexperience,Unspecified,4487074,Sedan,
9,12/14/2021,14:58,MANHATTAN,10017.0,40.751440,-73.973970,"(40.75144, -73.97397)",3 AVENUE,EAST 43 STREET,,...,0,0,0,0,0,Passing Too Closely,Unspecified,4486519,Sedan,Station Wagon/Sport Utility Vehicle
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2127136,07/10/2024,18:34,BRONX,10454.0,40.812263,-73.920590,"(40.812263, -73.92059)",WILLIS AVENUE,EAST 143 STREET,,...,0,0,0,0,0,Unspecified,,4746499,Taxi,
2127145,07/19/2024,18:00,BROOKLYN,11207.0,40.675735,-73.896860,"(40.675735, -73.89686)",ATLANTIC AVENUE,PENNSYLVANIA AVENUE,,...,0,0,0,0,0,Turning Improperly,Unspecified,4746359,Sedan,Sedan
2127162,07/07/2024,14:12,BRONX,10468.0,40.861084,-73.911490,"(40.861084, -73.91149)",,,2258 HAMPDEN PLACE,...,0,0,0,0,0,Unspecified,,4746320,Sedan,
2127172,07/21/2024,18:05,BROOKLYN,11224.0,40.572968,-74.000595,"(40.572968, -74.000595)",,,3514 SURF AVENUE,...,0,0,0,0,0,Backing Unsafely,Unspecified,4746425,Station Wagon/Sport Utility Vehicle,Pick-up Truck


In [50]:
# Let's see how we're doing now...
null_values=collisions.isnull().sum()
print("Our null values now:\n", null_values)

Our null values now:
 CRASH DATE                             0
CRASH TIME                             0
BOROUGH                                0
ZIP CODE                               0
LATITUDE                               0
LONGITUDE                              0
LOCATION                               0
ON STREET NAME                    321239
CROSS STREET NAME                 321778
OFF STREET NAME                  1107450
NUMBER OF PERSONS INJURED             11
NUMBER OF PERSONS KILLED              23
NUMBER OF PEDESTRIANS INJURED          0
NUMBER OF PEDESTRIANS KILLED           0
NUMBER OF CYCLIST INJURED              0
NUMBER OF CYCLIST KILLED               0
NUMBER OF MOTORIST INJURED             0
NUMBER OF MOTORIST KILLED              0
CONTRIBUTING FACTOR VEHICLE 1       5376
CONTRIBUTING FACTOR VEHICLE 2     236603
COLLISION_ID                           0
VEHICLE TYPE CODE 1                10403
VEHICLE TYPE CODE 2               290014
dtype: int64


## Removing unneeded columns
So we're definitely in a better spot now, but there is still much more that we can do. Firstly, we can see those "STREET NAME" columns have a lot of null values in them. Since the name of the street is too atomic to be useful for our regression equation, we can just get rid of those columns entirely. We can also see that every crash has a unique crash ID given to it by the NYPD. Again, this won't help us all with regression, so we'll scrap it as well. Finally, the contributing factor column is not a standardized categorical column. The data in there are manually entered strings that 

In [51]:
# Drop all of these columns in here
collisions.drop(['ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME', 'COLLISION_ID', \
                 'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2'], axis=1, inplace=True)

In [53]:
# Let's see how we're doing now...
null_values=collisions.isnull().sum()
print("Our null values now:\n", null_values)

Our null values now:
 CRASH DATE                            0
CRASH TIME                            0
BOROUGH                               0
ZIP CODE                              0
LATITUDE                              0
LONGITUDE                             0
LOCATION                              0
NUMBER OF PERSONS INJURED            11
NUMBER OF PERSONS KILLED             23
NUMBER OF PEDESTRIANS INJURED         0
NUMBER OF PEDESTRIANS KILLED          0
NUMBER OF CYCLIST INJURED             0
NUMBER OF CYCLIST KILLED              0
NUMBER OF MOTORIST INJURED            0
NUMBER OF MOTORIST KILLED             0
CONTRIBUTING FACTOR VEHICLE 1      5376
CONTRIBUTING FACTOR VEHICLE 2    236603
VEHICLE TYPE CODE 1               10403
VEHICLE TYPE CODE 2              290014
dtype: int64


## Dropping NA's in the remaining rows/columns
We're doing much better now. We still have around 250,000 columns with at least one NA. For our purposes here, since we have around 2 million rows, we can acceptably lose that data, so we'll now drop any rows with an NA

In [55]:
collisions.dropna(how='any', axis=0, inplace=True)

In [56]:
# Let's see how we're doing now...
null_values=collisions.isnull().sum()
print("Our null values now:\n", null_values)

Our null values now:
 CRASH DATE                       0
CRASH TIME                       0
BOROUGH                          0
ZIP CODE                         0
LATITUDE                         0
LONGITUDE                        0
LOCATION                         0
NUMBER OF PERSONS INJURED        0
NUMBER OF PERSONS KILLED         0
NUMBER OF PEDESTRIANS INJURED    0
NUMBER OF PEDESTRIANS KILLED     0
NUMBER OF CYCLIST INJURED        0
NUMBER OF CYCLIST KILLED         0
NUMBER OF MOTORIST INJURED       0
NUMBER OF MOTORIST KILLED        0
CONTRIBUTING FACTOR VEHICLE 1    0
CONTRIBUTING FACTOR VEHICLE 2    0
VEHICLE TYPE CODE 1              0
VEHICLE TYPE CODE 2              0
dtype: int64
