### Introduction

Original motivation to explore this dataset is because of the post I saw from New York Times on instagram. (link: https://www.instagram.com/p/C4KFCSELMr0/?utm_source=ig_web_copy_link&igsh=MzRlODBiNWFlZA%3D%3D) The posts states how e-bikes are having more serious accidents than traditional bikes. 

After reading the post, my immediate question was that does e-bikes having more accidents than traditional bikes generally due to the increase in usage of e-bikes. However, this information was not provided and want to investigate further. I found a relevant data set from NYC OpenData on crash cyclists https://data.cityofnewyork.us/Public-Safety/Crash-Cyclist-2020/2kbb-e72t/about_data. When I accessed the dataset, it is updated to March 26, 2024.

### Imports

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('Crash-Cyclist-2020_20240327.csv')

In [3]:
data

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,12/14/2021,12:54,BROOKLYN,11217.0,40.687534,-73.977500,"(40.687534, -73.9775)",FULTON STREET,SAINT FELIX STREET,,...,Unspecified,,,,4487052,Sedan,Bike,,,
1,12/14/2021,16:25,,,40.784615,-73.953964,"(40.784615, -73.953964)",EAST 93 STREET,,,...,Driver Inattention/Distraction,,,,4486581,Van,Bike,,,
2,04/24/2022,15:35,MANHATTAN,10019.0,40.767242,-73.986206,"(40.767242, -73.986206)",WEST 56 STREET,9 AVENUE,,...,Unspecified,,,,4521853,Station Wagon/Sport Utility Vehicle,Bike,,,
3,12/09/2021,20:20,BROOKLYN,11223.0,40.592070,-73.962990,"(40.59207, -73.96299)",EAST 7 STREET,CRAWFORD AVENUE,,...,Unspecified,,,,4485150,Bike,,,,
4,12/09/2021,23:15,BROOKLYN,11218.0,40.640835,-73.989670,"(40.640835, -73.98967)",12 AVENUE,41 STREET,,...,Driver Inattention/Distraction,,,,4485355,Sedan,Bike,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21096,02/14/2020,22:44,,,40.710907,-73.951650,"(40.710907, -73.95165)",BORINQUEN PLACE,,,...,Unspecified,Unspecified,Unspecified,Unspecified,4288986,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,E-Scooter,Sedan
21097,10/22/2022,8:55,BROOKLYN,11249.0,40.712860,-73.965790,"(40.71286, -73.96579)",SOUTH 4 STREET,WYTHE AVENUE,,...,Unspecified,,,,4575887,Station Wagon/Sport Utility Vehicle,Bike,,,
21098,11/06/2022,17:30,MANHATTAN,10022.0,40.762733,-73.973400,"(40.762733, -73.9734)",,,11 EAST 57 STREET,...,Driver Inattention/Distraction,,,,4580242,Taxi,Pedicab,,,
21099,09/21/2023,8:01,,,40.678818,-73.965350,"(40.678818, -73.96535)",BERGEN STREET,,,...,Passing or Lane Usage Improper,,,,4664391,Bus,Bike,,,


### Data Cleaning

Is spaces ok in column names? It's alright, just some pandas commands may not work

#### Observation 1
Date not in date time object, but in strings. Not the best to work with for date.

In [5]:
type(data['CRASH DATE'][0])

str

In [12]:
data['CRASH DATE'] = pd.to_datetime(data['CRASH DATE'])

In [11]:
data.sort_values(by='CRASH DATE')

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
13270,2020-01-02,20:42,MANHATTAN,10023.0,40.772243,-73.990005,"(40.772243, -73.990005)",WEST 60 STREET,WEST END AVENUE,,...,Unspecified,,,,4268270,Bike,,,,
13351,2020-01-02,17:35,BROOKLYN,11221.0,40.693874,-73.917770,"(40.693874, -73.91777)",CENTRAL AVENUE,GATES AVENUE,,...,Unspecified,,,,4268207,Station Wagon/Sport Utility Vehicle,Bike,,,
13301,2020-01-02,13:30,,,40.714165,-74.006320,"(40.714165, -74.00632)",CHAMBERS STREET,,,...,Unspecified,,,,4271563,Sedan,Bike,,,
13348,2020-01-02,16:00,MANHATTAN,10012.0,40.725643,-73.992070,"(40.725643, -73.99207)",BOWERY,EAST 2 STREET,,...,Unspecified,,,,4268322,Sedan,Bike,,,
13324,2020-01-02,21:30,MANHATTAN,10007.0,40.711933,-74.009850,"(40.711933, -74.00985)",,,30 VESEY STREET,...,Unspecified,,,,4271936,Sedan,Bike,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20732,2024-03-23,21:00,BROOKLYN,11201.0,40.692318,-73.989660,"(40.692318, -73.98966)",,,250 JORALEMON STREET,...,Unspecified,,,,4712033,Sedan,Bike,,,
20733,2024-03-23,8:20,BROOKLYN,11211.0,40.716840,-73.956640,"(40.71684, -73.95664)",,,559 DRIGGS AVENUE,...,Unspecified,,,,4712331,Station Wagon/Sport Utility Vehicle,Bike,,,
20736,2024-03-23,16:33,QUEENS,11106.0,40.760800,-73.939150,"(40.7608, -73.93915)",36 AVENUE,12 STREET,,...,Unspecified,,,,4711895,Station Wagon/Sport Utility Vehicle,E-Bike,,,
20737,2024-03-23,21:40,BROOKLYN,11207.0,40.680405,-73.887640,"(40.680405, -73.88764)",WARWICK STREET,ARLINGTON AVENUE,,...,Unspecified,,,,4712173,Sedan,E-Bike,,,


#### Observation 2
Missingness:
- Missing values in `BOROUGH` can be depended on the columns in `ON STREET NAME`, `CROSS STREET NAME` or `OFF STREET NAME`. However, it is complicated to recover due to so many roads in one area. This is also similar for `LATITUDE` and `LONGITUDE`.
- We either have information for `OFF STREET NAME` or `ON STREET NAME`, and may or may not have `CROSS STREET NAME`. `OFF STREET NAME` provides the exact address if known where `ON STREET NAME` and `CROSS STREET NAME` is more of a general area.
- Missing values in `CONTRIBUTING FACTOR VEHICLE 2/3/4/5` are misssing is due to the incident does not have that many vehicles involved.
- Not sure about the relations between missingness of `CONTRIBUTING FACTOR VEHICLE` and `VEHICLE TYPE`.