# Data cleaning
This notebook cleans the dataset [Motor Vehicle Collisions - Crashes](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) from the [NYC Open Data](https://opendata.cityofnewyork.us/) portal. The dataset contains information on all crashes but for this project I am only interested in crashes involving bikes. The main part of the cleaning is thus to sorting the dataset to only include bike accidents.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/nyc_motor_vehicle_crashes.csv')

Only keep the accidents where a cyclist was either injured or killed. 

In [None]:
bike_df = df[(df['NUMBER OF CYCLIST INJURED'] > 0) | (df['NUMBER OF CYCLIST KILLED'] > 0) ]

Turn the crash date into a datetime object

In [None]:
bike_df['CRASH DATE'] = pd.to_datetime(bike_df['CRASH DATE']) 

And remove columns that are not relevant to our analysis

In [None]:
bike_df = bike_df.drop(['NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED','NUMBER OF MOTORIST INJURED','NUMBER OF MOTORIST KILLED'], axis=1)

Lastly, the new and cleaned dataset is saved as a csv file.

In [None]:
bike_df.to_csv('data/nyc_bike_accidents.csv')