NYC open data has a vehicle crash dataset that contains over 1.8m records of accidents across the give boroughs. 

My package aims to build an easy function to connect to this API given certain parameters to extract whatever subset of this dataset needed. For example, the data could be filtered by time, date. borough, zipcodes and injuries/deaths.

The package can be found at https://test.pypi.org/project/final-project-jc5492/

In [20]:
!pip install -i https://test.pypi.org/pypi/ --extra-index-url https://pypi.org/simple final-project-jc5492

Looking in indexes: https://test.pypi.org/pypi/, https://pypi.org/simple


You should consider upgrading via the 'C:\Users\Jian Tong\AppData\Local\pypoetry\Cache\virtualenvs\final-project-jc5492-FmXodZ5A-py3.9\Scripts\python.exe -m pip install --upgrade pip' command.




In [21]:
from final_project_jc5492 import final_project_jc5492 as final

In [22]:
help(final.get_crash_data)

Help on function get_crash_data in module final_project_jc5492.final_project_jc5492:

get_crash_data(params={'limit': 10000})
    Gets vehicle collision data from NYC open data API based on provided dictionary of parameters.
    
    Parameters
    ----------
    date : String
      Either a single date in "YYYY-MM-DD" format, or a list of [min date, max date]
    time : String
      Either a single time in "HH:MM" format, or a list of [min time, max time]
    zip_code : Integer
      Either a single 5 digit zip code, or a list of [min zip code, max zip code]
    borough : String
      List of boroughs to be included i.e. ["MANHATTAN","QUEENS","BRONX"]
    injury : Integer
      Either a single integer or a list of [min injuries, max injuries]
    death : Integer
      Either a single integer or a list of [min deaths, max deaths]
    limit : Integer 
      Maximum number of rows to return from API
    
    Returns
    --------
    Pandas DataFrame
      DataFrame of vehicle collisions 

In [31]:
default_params = {"date":['2021-01-1','2021-01-15'],
          "time":["09:00","17:00" ],
          "borough":["MANHATTAN"],
          "injury":[0,10],
          "death":[0,0],
          "limit":10000}

In [32]:
df = final.get_crash_data(default_params)
df.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,number_of_persons_injured,...,vehicle_type_code1,contributing_factor_vehicle_2,cross_street_name,vehicle_type_code2,contributing_factor_vehicle_3,vehicle_type_code_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,vehicle_type_code_4,vehicle_type_code_5
0,2021-01-01T00:00:00.000,1:02,MANHATTAN,10029,40.78744,-73.94478,"{'latitude': '40.78744', 'longitude': '-73.944...",2 AVENUE,EAST 101 STREET,2,...,Taxi,,,,,,,,,
1,2021-01-01T00:00:00.000,16:25,MANHATTAN,10018,40.759514,-73.99926,"{'latitude': '40.759514', 'longitude': '-73.99...",11 AVENUE,WEST 40 STREET,0,...,Sedan,Unspecified,,,,,,,,
2,2021-01-01T00:00:00.000,1:12,MANHATTAN,10019,40.76325,-73.989136,"{'latitude': '40.76325', 'longitude': '-73.989...",,,0,...,Station Wagon/Sport Utility Vehicle,Unspecified,733 9 AVENUE,Bike,,,,,,
3,2021-01-01T00:00:00.000,16:40,MANHATTAN,10034,40.864403,-73.923775,"{'latitude': '40.864403', 'longitude': '-73.92...",SHERMAN AVENUE,ACADEMY STREET,0,...,Station Wagon/Sport Utility Vehicle,Unspecified,,Sedan,,,,,,
4,2021-01-02T00:00:00.000,11:37,MANHATTAN,10001,40.752834,-74.004715,"{'latitude': '40.752834', 'longitude': '-74.00...",,,0,...,Sedan,,601 WEST 29 STREET,,,,,,,


One problem with the dataset is that not every observation has a location provided in latitude and longitude. To work around this, I've created a function using the geocoder packager and OpenStreetMaps to obtain the latitude and longitude for these observataions based on the addresses found in "on_street_name", "off_street_name" and "cross_street_name"

In [33]:
#rows with missing long/lat
df[df['location'].isna()]

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,number_of_persons_injured,...,vehicle_type_code1,contributing_factor_vehicle_2,cross_street_name,vehicle_type_code2,contributing_factor_vehicle_3,vehicle_type_code_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,vehicle_type_code_4,vehicle_type_code_5
45,2021-01-07T00:00:00.000,10:44,MANHATTAN,10019,,,,12 AVENUE,WEST 51 STREET,1,...,Station Wagon/Sport Utility Vehicle,Unspecified,,Sedan,,,,,,
61,2021-01-08T00:00:00.000,15:00,MANHATTAN,10019,,,,WEST 56 STREET,12 AVENUE,1,...,Station Wagon/Sport Utility Vehicle,Unspecified,,Station Wagon/Sport Utility Vehicle,,,,,,
76,2021-01-10T00:00:00.000,11:35,MANHATTAN,10021,,,,EAST 73 STREET,FDR DRIVE,0,...,Sedan,Unspecified,,Station Wagon/Sport Utility Vehicle,,,,,,


In [34]:
df[df['location'].isna()].apply(final.geocode_missing_row, axis = 1)

45                  [42.63981715, -73.7768841]
61    [40.763741350000004, -73.97900946431503]
76                   [40.7881752, -73.9385601]
dtype: object