# NYC Motor Vehicle Collisions - Data Profiling and Data Cleaning
In the following we are going to profile and clean the NYC Motor Vehicle Collisions dataset, which contains collisions occured from 2012 to 2021. The dataset consists of over 1.8 million rows and the compressed data file is about 73 MB.
We will use [`OpenClean`](https://github.com/VIDA-NYU/openclean)  and [`geopy`](https://pypi.org/project/geopy/) to profile and clean the data.

Before we start, let us configure the environment
  
  `pip install openclean`

  `pip install openclean-geo`
  
  `pip install geopy`

  `pip install humanfriendly`

# Datasets and Streams
The identifier of the vehicle collisions dataset is `h9gi-nx95`. The following code downloads the dataset in tab-delimited CSV format and it will be stored in a local file called `h9gi-nx95.tsv.gz`.

In [1]:
# Download the full 'Motor Vehicle Collisions - Crashes' dataset.

import gzip
import humanfriendly
import os
from openclean.data.source.socrata import Socrata

dataset = Socrata().dataset('h9gi-nx95')

datafile = './h9gi-nx95.tsv.gz'

# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)

fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Using 'Motor Vehicle Collisions - Crashes' in file ./h9gi-nx95.tsv.gz of size 73.34 MB


In [2]:
# Open the downloaded dataset to extract the relevant columns and records.

from openclean.pipeline import stream
import pandas as pd
import numpy as np

datafile = './h9gi-nx95.tsv.gz'
ds = stream(datafile)
df_full = ds.to_df()

According to the entropy (we will calculate later) of each attribute, some data have little impact on the results, like CONTRIBUTING FACTOR VEHICLE 3-5 and VEHICLE TYPE CODE 3-5. So we will remove them.

In [3]:
# select the subset of columns

df = df_full[['CRASH DATE',
            'CRASH TIME',
            'BOROUGH',
            'ZIP CODE',
            'LATITUDE',
            'LONGITUDE',
            'LOCATION',
            'ON STREET NAME', 
            'CROSS STREET NAME',
            'OFF STREET NAME',
            'NUMBER OF PERSONS INJURED',
            'NUMBER OF PERSONS KILLED',
            'NUMBER OF PEDESTRIANS INJURED',
            'NUMBER OF PEDESTRIANS KILLED',
            'NUMBER OF CYCLIST INJURED',
            'NUMBER OF CYCLIST KILLED',
            'NUMBER OF MOTORIST INJURED',
            'NUMBER OF MOTORIST KILLED',
            'CONTRIBUTING FACTOR VEHICLE 1',
            'CONTRIBUTING FACTOR VEHICLE 2',
            'COLLISION_ID',
            'VEHICLE TYPE CODE 1',
            'VEHICLE TYPE CODE 2']]

# Data Profiling
We use the default column profiler from `openclean` to compute basic statistics as the number of empty values, distinct values, etc.

In [4]:
# Profile the resulting dataset view using the default data profiler
from openclean.profiling.dataset import dataset_profile

profile = dataset_profile(df)

In [5]:
# Print overview of profiling results

profile.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
CRASH DATE,1847986,0,3446,0.001865,11.68696
CRASH TIME,1847986,0,1440,0.000779,8.931117
BOROUGH,1847986,571101,5,4e-06,2.118388
ZIP CODE,1847986,571323,232,0.000182,7.221095
LATITUDE,1847986,215691,122520,0.07506,15.634063
LONGITUDE,1847986,215691,96066,0.058853,15.342907
LOCATION,1847986,215691,244758,0.149947,16.186366
ON STREET NAME,1847986,378148,16138,0.010979,10.60044
CROSS STREET NAME,1847986,665674,19279,0.016306,11.809362
OFF STREET NAME,1847986,1562207,180555,0.631799,16.921532


In [6]:
# Print data types for each column.
profile.types()

Unnamed: 0,date,float,int,str
CRASH DATE,3446,0,0,0
CRASH TIME,0,0,0,1440
BOROUGH,0,0,0,5
ZIP CODE,0,0,231,1
LATITUDE,0,122519,1,0
LONGITUDE,0,96064,2,0
LOCATION,0,0,0,244758
ON STREET NAME,87,0,16,16035
CROSS STREET NAME,2,1,27,19249
OFF STREET NAME,31,0,1,180523


According to the overview of profiling results and data types for each column, some problems can be very obvious, like empty data and error data. So,a simple check of the data of each columun is necessary before cleaning it.

## CRASH DATE

In [7]:
# Print the minimum and maximum value for column 'CRASH DATE'
profile.minmax('CRASH DATE')

Unnamed: 0,min,max
date,2012-07-01,2021-12-06


In [8]:
# Using the default settings yields two outliers.

from openclean.profiling.anomalies.sklearn import DBSCANOutliers

crashdate = ds.distinct('CRASH DATE')

errorList = DBSCANOutliers().find(crashdate)

In [9]:
if(len(errorList)!=0):
    print("deal!")
    df = df.drop(df['CRASH DATE'].isin(errorList).index)

## CRASH TIME

In [10]:
# Print the minimum and maximum value for column 'CRASH TIME'
profile.minmax('CRASH TIME')

Unnamed: 0,min,max
str,0:00,9:59


In [11]:
# Using the default settings yields two outliers.

from openclean.profiling.anomalies.sklearn import DBSCANOutliers

crashtime = ds.distinct('CRASH TIME')

errorList = DBSCANOutliers().find(crashtime)

In [12]:
errorList

[]

In [13]:
if(len(errorList)!=0):
    print("deal!")
    df = df.drop(df['CRASH TIME'].isin(errorList).index)

## BOROUGH

In [14]:
profile.minmax('BOROUGH')

Unnamed: 0,min,max
str,BRONX,STATEN ISLAND


In [15]:
# Get set of distinct values for column 'Registration State'. Print the
# values in decreasing order of frequency.

states = ds.distinct('BOROUGH')
for rank, val in enumerate(states.most_common()):
    st, freq = val
    print(f'{rank + 1:<3} {st}  {freq:>10,}')

1        571,101
2   BROOKLYN     402,084
3   QUEENS     342,038
4   MANHATTAN     293,418
5   BRONX     185,558
6   STATEN ISLAND      53,787


## ZIP CODE

In [16]:
profile.minmax('ZIP CODE')

Unnamed: 0,min,max
int,10000.0,11697.0
str,,


In [17]:
# Get set of distinct values for column 'Registration State'. Print the
# values in decreasing order of frequency.

states = ds.distinct('ZIP CODE')
for rank, val in enumerate(states.most_common()):
    st, freq = val
    print(f'{rank + 1:<3} {st}  {freq:>10,}')

1        571,323
2   11207      24,003
3   11101      17,324
4   11236      16,721
5   11203      16,188
6   10019      16,052
7   11385      15,853
8   11234      15,812
9   11201      15,547
10  10016      15,519
11  10036      15,427
12  11226      14,937
13  10022      14,906
14  10001      14,859
15  11212      14,851
16  11208      14,800
17  11434      14,731
18  10002      13,634
19  10013      13,521
20  11233      13,212
21  10467      12,733
22  11230      12,613
23  11206      12,524
24  11220      12,160
25  11211      12,055
26  11368      11,860
27  11373      11,394
28  11377      11,381
29  11354      11,378
30  11235      11,212
31  10018      11,136
32  11213      11,046
33  10466      10,781
34  11217      10,613
35  10458      10,420
36  11210      10,347
37  11223      10,265
38  11215      10,194
39  11432      10,130
40  10011      10,122
41  11355      10,078
42  11221       9,948
43  11372       9,831
44  11219       9,828
45  10457       9,703
46  10451      

In [18]:
# remove blank
df.loc[(df['ZIP CODE'].str.contains(' ', regex=False))] = ''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [19]:
# New York zip code

zipcodes = ['10001', '10451', '10002', '10452', '10003', '10453', '10004', '10454', '10005', '10455', '10006',
               '10456', '10007', '10457', '10009', '10458', '10010', '10459', '10011', '10460', '10012', '10461',
               '10013', '10462', '10014', '10463', '10015', '10464', '10016', '10465', '10017', '10466', '10018',
               '10467', '10019', '10468', '10020', '10469', '10021', '10470', '10022', '10471', '10023', '10472',
               '10024', '10473', '10025', '10474', '10026', '10475', '10027', '11201', '10028', '11203', '10029',
               '11204', '10030', '11205', '10031', '11206', '10032', '11207', '10033', '11208', '10034', '11209',
               '10035', '11210', '10036', '11211', '10037', '11212', '10038', '11213', '10039', '11214', '10040',
               '11215', '10041', '11216', '10044', '11217', '10045', '11218', '10048', '11219', '10055', '11220',
               '10060', '11221', '10069', '11222', '10090', '11223', '10095', '11224', '10098', '11225', '10099',
               '11226', '10103', '11228', '10104', '11229', '10105', '11230', '10106', '11231', '10107', '11232',
               '10110', '11233', '10111', '11234', '10112', '11235', '10115', '11236', '10118', '11237', '10119',
               '11238', '10120', '11239', '10121', '11241', '10122', '11242', '10123', '11243', '10128', '11249',
               '10151', '11252', '10152', '11256', '10153', '10001', '10154', '10002', '10155', '10003', '10158',
               '10004', '10161', '10005', '10162', '10006', '10165', '10007', '10166', '10009', '10167', '10010',
               '10168', '10011', '10169', '10012', '10170', '10013', '10171', '10014', '10172', '10015', '10173',
               '10016', '10174', '10017', '10175', '10018', '10176', '10019', '10177', '10020', '10178', '10021',
               '10199', '10022', '10270', '10023', '10271', '10024', '10278', '10025', '10279', '10026', '10280',
               '10027', '10281', '10028', '10282', '10029', '10301', '10030', '10302', '10031', '10303', '10032',
               '10304', '10033', '10305', '10034', '10306', '10035', '10307', '10036', '10308', '10037', '10309',
               '10038', '10310', '10039', '10311', '10040', '10312', '10041', '10314', '10044', '10451', '10045',
               '10452', '10048', '10453', '10055', '10454', '10060', '10455', '10069', '10456', '10090', '10457',
               '10095', '10458', '10098', '10459', '10099', '10460', '10103', '10461', '10104', '10462', '10105',
               '10463', '10106', '10464', '10107', '10465', '10110', '10466', '10111', '10467', '10112', '10468',
               '10115', '10469', '10118', '10470', '10119', '10471', '10120', '10472', '10121', '10473', '10122',
               '10474', '10123', '10475', '10128', '11004', '10151', '11101', '10152', '11102', '10153', '11103',
               '10154', '11104', '10155', '11105', '10158', '11106', '10161', '11109', '10162', '11201', '10165',
               '11203', '10166', '11204', '10167', '11205', '10168', '11206', '10169', '11207', '10170', '11208',
               '10171', '11209', '10172', '11210', '10173', '11211', '10174', '11212', '10175', '11213', '10176',
               '11214', '10177', '11215', '10178', '11216', '10199', '11217', '10270', '11218', '10271', '11219',
               '10278', '11220', '10279', '11221', '10280', '11222', '10281', '11223', '10282', '11224', '11101',
               '11225', '11102', '11226', '11103', '11228', '11004', '11229', '11104', '11230', '11105', '11231',
               '11106', '11232', '11109', '11233', '11351', '11234', '11354', '11235', '11355', '11236', '11356',
               '11237', '11357', '11238', '11358', '11239', '11359', '11241', '11360', '11242', '11361', '11243',
               '11362', '11249', '11363', '11252', '11364', '11256', '11365', '11351', '11366', '11354', '11367',
               '11355', '11368', '11356', '11369', '11357', '11370', '11358', '11371', '11359', '11372', '11360',
               '11373', '11361', '11374', '11362', '11375', '11363', '11377', '11364', '11378', '11365', '11379',
               '11366', '11385', '11367', '11411', '11368', '11412', '11369', '11413', '11370', '11414', '11371',
               '11415', '11372', '11416', '11373', '11417', '11374', '11418', '11375', '11419', '11377', '11420',
               '11378', '11421', '11379', '11422', '11385', '11423', '11411', '11426', '11412', '11427', '11413',
               '11428', '11414', '11429', '11415', '11430', '11416', '11432', '11417', '11433', '11418', '11434',
               '11419', '11435', '11420', '11436', '11421', '11691', '11422', '11692', '11423', '11693', '11426',
               '11694', '11427', '11697', '11428', '10301', '11429', '10302', '11430', '10303', '11432', '10304',
               '11433', '10305', '11434', '10306', '11435', '10307', '11436', '10308', '11691', '10309', '11692',
               '10310', '11693', '10311', '11694', '10312', '11697', '10314', '10065', '10075', '11001', '11695',
               '11251','10179']

In [20]:
# remove not in new york zipcode
df = df.drop(df.loc[(~df['ZIP CODE'].isin(zipcodes)) &( df['ZIP CODE'] != '')].index)

## LOCATION

In [21]:
# remove 0
df['LATITUDE'] = df['LATITUDE'].replace('0', '', regex = True)
df['LONGITUDE'] = df['LONGITUDE'].replace('0', '', regex = True)
df['LOCATION'] = df['LOCATION'].replace('(0.0, 0.0)', '', regex = False)

# Data Cleaning



## Missing Geographic Information

As for motor collision data, geographic attributes are vital. This is also reflected in the entropy of each attribute. So, if all the vital information is missing, the record is useless, it will be deleted. Specifically, the row with empty BOROUGH, ZIP CODE, STREET NAME(ON/OFF/CROSS) and LOCATION(LATITUDE/ LONGITUDE) will be deleted.


In [22]:
# drop all records of lost geographical attributes
df = df.drop(df[(df['LOCATION'] == '') & \
                (df['ON STREET NAME'] == '') & \
                (df['OFF STREET NAME'] == '') &  \
                (df['CROSS STREET NAME'] == '') & \
                (df['BOROUGH'] == '') & \
                (df['ZIP CODE'] == '')].index)

For all the missing longtitude/latitude/on street name/cross street name/off street name, there will be difficult to calculate the location through Map API to navigate the other values

In [23]:
# drop all records that cannot calculate location 
df = df.drop(df[((df['LONGITUDE'] == '')  | (df['LONGITUDE'] == '0')) &\
                (df['ON STREET NAME'] == '') & (df['OFF STREET NAME'] == '') &  
                (df['CROSS STREET NAME'] == '')].index)

## Fill Geographic Information
For calculating other missing geological values, importing geopy library into the data cleaning process. With specific LONGITUDE and LATITUDE, using librabry to get zipcodes and address is farly easy and tidy

However, the problem of using Geocoding API is it has usage limits which only allows 50 requests per second; Ideally, if we process 50 records per second, it costs approximately 25 days to process the whole dataset which is not time efficient. We did not find a better way to solve this problem, so we only calculate 20 records for demostration.

In [28]:
# Using an open source library geopy to fill empty location
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="data-cleaning-project")

# according to location to gat all geographic cloumn
def find_ZIPCODE(x):
    location = geolocator.reverse(x['LOCATION'][1:-1])
    list = location.address.split(",")
    
    # get index of x
    dfindex = df.index
    condition = df['COLLISION_ID'] == x['COLLISION_ID']
    x_indices = dfindex[condition]
    x_indices_list = x_indices.tolist()
    index = x_indices_list[0]
    
    ZIPCODE = list[-2]
    BOROUGH = list[-5]
    OFFSTREETNAME = ','.join(list[0:-6])
    
    df.at[index,'BOROUGH'] = BOROUGH
    df.at[index,'ZIP CODE'] = ZIPCODE
    df.at[index,'OFF STREET NAME'] = OFFSTREETNAME

In [32]:
df.loc[df['LOCATION'] != ''].head(20).apply(find_ZIPCODE, axis=1)
print()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
1,04/13/2021,21:35,Brooklyn,11217,4.68358,-73.97617,"(40.68358, -73.97617)",,,"Atlantic Center, 625, Atlantic Avenue",...,0,0,0,0,0,Unspecified,,4407147,Sedan,
13,05/21/2019,22:50,Brooklyn,11201,4.69754,-73.98312,"(40.69754, -73.98312)",GOLD STREET,CONCORD STREET,"194, Concord Street",...,0,0,0,0,0,Passing or Lane Usage Improper,Unspecified,4136992,锟組BU,Taxi
15,02/26/2021,14:50,The Bronx,10461,4.843464,-73.836,"(40.843464, -73.836)",,,"2815, Middletown Road",...,0,0,0,0,0,Unspecified,Unspecified,4395664,Station Wagon/Sport Utility Vehicle,
16,03/09/2021,11:00,Brooklyn,11231,4.692547,-73.99974,"(40.692547, -73.990974)",COURT STREET,JORALEMON STREET,"Borough Hall (4,5), Court Street",...,0,0,0,1,0,Following Too Closely,Unspecified,4397513,Pick-up Truck,Sedan
17,03/31/2021,22:20,Brooklyn,11234,4.626457,-73.918,"(40.626457, -73.918)",RALPH AVENUE,AVENUE K,"5924, Avenue K",...,0,0,0,1,0,Driver Inexperience,Unspecified,4403773,Sedan,Sedan
18,04/06/2021,22:58,Staten Island,10312,4.526894,-74.16728,"(40.526894, -74.16728)",BARCLAY AVENUE,HYLAN BOULEVARD,4844,...,0,0,0,7,0,Failure to Yield Right-of-Way,Unsafe Speed,4405244,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle
19,04/09/2021,14:45,The Bronx,10460,4.84775,-73.87246,"(40.840775, -73.87246)",BRONX RIVER PARKWAY,,"Morris Park Avenue & Adams Street, Bronx River...",...,0,0,0,1,0,Driver Inattention/Distraction,Unspecified,4405914,Sedan,Station Wagon/Sport Utility Vehicle
21,04/14/2021,11:00,Queens,11411,4.69435,-73.72679,"(40.694035, -73.72679)",CROSS ISLAND PARKWAY,,,...,0,0,0,1,0,Brakes Defective,Unspecified,4407366,Station Wagon/Sport Utility Vehicle,Sedan
22,04/15/2021,13:30,The Bronx,10469,4.857365,-73.84657,"(40.857365, -73.84657)",,,Albert Einstein College of Medicine,...,0,0,0,0,0,Following Too Closely,Unspecified,4407778,Sedan,Ambulance
23,04/14/2021,14:40,Brooklyn,11237,4.69887,-73.91837,"(40.698807, -73.91837)",MYRTLE AVENUE,,"1422, Myrtle Avenue",...,0,0,0,0,0,Passing Too Closely,Unspecified,4407461,Box Truck,Sedan


In [33]:
def find_zipcode(location):
    address = geolocator.geocode(location)
    arr = address.raw['display_name'].split(', ')
    return arr[len(arr) - 2]

def find_borough(location):
    address = geolocator.geocode(location)
    arr = address.raw['display_name'].split(', ')
    return arr[len(arr) - 5]

def find_street(location):
    address = geolocator.geocode(location)
    arr = address.raw['display_name'].split(', ')
    street = ''
    for i in range(1, len(arr) - 5):
        street += arr[i] + ' '
    return street

def find_latlng(street):
    location = geolocator.geocode(street)
    try:
        lat = location.latitude
        lng = location.longitude
    except AttributeError:
        lat = 0
        lng = 0
    return (lat, lng)

def find_location(x):
    if x['ON STREET NAME'] != '':
        return find_latlng(x['ON STREET NAME'])
    elif x['CROSS STREET NAME'] != '':
        return find_latlng(x['CROSS STREET NAME'])
    elif x['OFF STREET NAME'] != '':
        return find_latlng(x['OFF STREET NAME'])
    else:
        return x['LOCATION']

In [34]:
# find location, latitude, langitude by street name (only)
df.loc[df['LOCATION'] == '', 'LOCATION'] = df.loc[df['LOCATION'] == ''].head(20).apply(find_location, axis=1)

## Uppercase
Standardizing all the String type value is clear for professionals to analyze in the future

In [35]:
# Change all words to upper case
df.apply(lambda x: x.astype(str).str.upper())

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
0,04/14/2021,5:32,,,,,"(40.80205005, -73.8297471344276)",BRONX WHITESTONE BRIDGE,,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407480,SEDAN,SEDAN
1,04/13/2021,21:35,BROOKLYN,11217,4.68358,-73.97617,"(40.68358, -73.97617)",,,"ATLANTIC CENTER, 625, ATLANTIC AVENUE",...,0,0,0,0,0,UNSPECIFIED,,4407147,SEDAN,
2,04/15/2021,16:15,,,,,"(40.849642349999996, -73.83640393750001)",HUTCHINSON RIVER PARKWAY,,,...,0,0,0,0,0,PAVEMENT SLIPPERY,,4407665,STATION WAGON/SPORT UTILITY VEHICLE,
3,04/13/2021,16:00,BROOKLYN,11222,,,"(42.083058, -76.05075)",VANDERVORT AVENUE,ANTHONY STREET,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407811,SEDAN,
4,04/12/2021,8:25,,,,,"(41.5475152, -73.0123417)",EDSON AVENUE,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,4406885,STATION WAGON/SPORT UTILITY VEHICLE,SEDAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1847981,07/06/2012,15:09,MANHATTAN,10035,4.812354,-73.9418153,"(40.8012354, -73.9418153)",EAST 119 STREET,PARK AVENUE,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,59654,SPORT UTILITY / STATION WAGON,PASSENGER VEHICLE
1847982,07/03/2012,17:30,QUEENS,11102,4.7747112,-73.9333863,"(40.7747112, -73.9333863)",27 AVENUE,4 STREET,,...,0,0,0,2,0,FAILURE TO YIELD RIGHT-OF-WAY,UNSPECIFIED,272592,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON
1847983,07/01/2012,15:30,BROOKLYN,11236,4.645318,-73.9199775,"(40.6450318, -73.9199775)",RALPH AVENUE,CLARENDON ROAD,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,135041,SMALL COM VEH(4 TIRES),PASSENGER VEHICLE
1847984,07/08/2012,18:30,,,4.7861217,-73.84782,"(40.7861217, -73.8040782)",,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,3055617,PASSENGER VEHICLE,PASSENGER VEHICLE


## Street Format
Normalize the street name to more proper format by using `StandardizeUSStreetName` from `openclean_geo`

In [36]:
# '158 st      '
df['ON STREET NAME'] = df['ON STREET NAME'].map(lambda x: x.strip())
df['OFF STREET NAME'] = df['OFF STREET NAME'].map(lambda x: x.strip())
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].map(lambda x: x.strip())
# '60-30      30'
df['ON STREET NAME'] = df['ON STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['OFF STREET NAME'] = df['OFF STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)

In [37]:
# Use street name standardization operator to modify street names
from openclean_geo.address.usstreet import StandardizeUSStreetName
f = StandardizeUSStreetName(characters='upper', alphanum=True, repeated=False)
df['ON STREET NAME'] = f.apply(df['ON STREET NAME'], threads=3)
df['CROSS STREET NAME'] = f.apply(df['CROSS STREET NAME'], threads=3)
df['OFF STREET NAME'] = f.apply(df['OFF STREET NAME'], threads=3)

In [38]:
# '158 st      '
df['ON STREET NAME'] = df['ON STREET NAME'].map(lambda x: x.strip())
df['OFF STREET NAME'] = df['OFF STREET NAME'].map(lambda x: x.strip())
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].map(lambda x: x.strip())
# '60-30      30'
df['ON STREET NAME'] = df['ON STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['OFF STREET NAME'] = df['OFF STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)

## Error  Data Type 

There are some rows with error data type，like int and date type in ON STREET NAME, CROSS STREET NAME, OFF STREET NAME. Regular expression is a good way to solve it.

In [39]:
# Replace the error type data like int, date
df['ON STREET NAME'] = df['ON STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['OFF STREET NAME'] = df['OFF STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)

## Missing Data
In the vehicle collision data set, there are some attributes that are missing important information. For example,  with missing value - CONTRIBUTING FACTOR VEHICLE 1-5, we can fill up with ‘UNSPECIFIED’. 


In [40]:
# Fill empty values with 'Unspecified'
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].replace('', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].replace('', 'UNSPECIFIED', regex = True)

As for VEHICLE TYPE CODE 1-2，we can fill up with ‘UNKNOWN’.

In [41]:
df['VEHICLE TYPE CODE 1'] = df['VEHICLE TYPE CODE 1'].replace('', 'UNKNOWN', regex = True)
df['VEHICLE TYPE CODE 2'] = df['VEHICLE TYPE CODE 2'].replace('', 'UNKNOWN', regex = True)

Also from Column 10-17, ‘NUMBER OF PERSONS INJURED’ to NUMBER OF MOTORIST INJURED, replacing the empty value in attributes NUMBER OF PERSONS INJURED/NUMBER OF PERSONS KILLED to 0.

In [42]:
# Replace empty values with '0'
df['NUMBER OF PERSONS INJURED'] = df['NUMBER OF PERSONS INJURED'].str.upper().replace('', '0', regex = True)
df['NUMBER OF PERSONS KILLED'] = df['NUMBER OF PERSONS KILLED'].str.upper().replace('', '0', regex = True)

## Type Error
There are so many type errors, like confusion of case, missing or adding characters and just only errors.

As for CONTRIBUTING FACTOR VEHICLE, some numbers are uselessful, so they will be replaced by “UNSPECIFIED”. 


In [43]:
# CONTRIBUTING FACTOR VEHICLE
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].str.upper().replace('ILLNES', 'ILLNESS', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].str.upper().replace('80', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].str.upper().replace('1', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].str.upper().replace('ILLNES', 'ILLNESS', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].str.upper().replace('80', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].str.upper().replace('1', 'UNSPECIFIED', regex = True)

VEHICLE TYPE CODE is more complex than CONTRIBUTING FACTOR VEHICLE. There are thousands of vehicle types and type errors. For example, just for the ambulance type, there are kinds of spelling, 'AMB', 'AMBU', 'AMBUKANCE', 'AMBUL', 'AMBULACE', 'AMBULANCE', 'AMBULANE', 'AMBULENCE', 'AMBULETTE', 'AMDU', 'AMUBULANCE', 'AMULANCE'. So applying [Standardizing Spellings](https://github.com/VIDA-NYU/openclean/blob/master/examples/notebooks/Standardization%20of%20Ethiopian%20Calendar%20and%20Woreda%20Names.ipynb) is necessary.


In [44]:
# Create a Matcher to match vehicle type
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.function.matching.base import DefaultStringMatcher

vehicle_type = set(['SEDAN','4 DR SEDAN','2 DR SEDAN','MOTORCYCL','TAXI',
                    'VAN','TRUCK','BUS','BIKE','MOTORCYCLE',
                    'STATION WAGON / SPORT UTILITY VEHICLE',
                    'LARGE COM VEH','SMALL COM VEH','OTHER',
                    'E-BIKE','E-SCOOTER','AMBULANCE','UNKNOWN',
                    'LIVERY VEHICLE','TRACTOR TRUCK DIESEL',
                    'CONVERTIBLE','DUMP','FDNY','USPS','TANK'])

matcher = DefaultStringMatcher(
            vocabulary = vehicle_type,
            similarity = FuzzySimilarity(),
            best_matches_only=True,
            no_match_threshold=0.2,
            cache_results = True)

def standardizeVehicleType(x):
    vtype = ""
    try:
        vtype = matcher.find_matches(x)[0].term
    except TypeError:
        vtype = "UNKNOWN"
    except IndexError:
        vtype = "UNKNOWN"
    return vtype

Applying the function consturcted above to apply to attributes VEHICLE TYPE CODE1 AND VEHICLE TYPE CODE 2

In [45]:
# Apply standardize vehicle type method on VEHICLE TYPE CODE 1 and VEHICLE TYPE CODE 2
df["VEHICLE TYPE CODE 1"] = df["VEHICLE TYPE CODE 1"].map(standardizeVehicleType)
df["VEHICLE TYPE CODE 2"] = df["VEHICLE TYPE CODE 2"].map(standardizeVehicleType)

In [46]:
df

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
0,04/14/2021,5:32,,,,,"(40.80205005, -73.8297471344276)",BRONX WHITESTONE BRG,,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407480,SEDAN,SEDAN
1,04/13/2021,21:35,Brooklyn,11217,4.68358,-73.97617,"(40.68358, -73.97617)",,,ATLANTIC CENTER 625 ATLANTIC AVE,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,4407147,SEDAN,UNKOWN
2,04/15/2021,16:15,,,,,"(40.849642349999996, -73.83640393750001)",HUTCHINSON RIVER PKWY,,,...,0,0,0,0,0,PAVEMENT SLIPPERY,UNSPECIFIED,4407665,STATION WAGON / SPORT UTILITY VEHICLE,UNKOWN
3,04/13/2021,16:00,BROOKLYN,11222,,,"(42.083058, -76.05075)",VANDERVORT AVE,ANTHONY ST,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407811,SEDAN,UNKOWN
4,04/12/2021,8:25,,,,,"(41.5475152, -73.0123417)",EDSON AVE,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,4406885,STATION WAGON / SPORT UTILITY VEHICLE,SEDAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1847981,07/06/2012,15:09,MANHATTAN,10035,4.812354,-73.9418153,"(40.8012354, -73.9418153)",EAST 119 ST,PARK AVE,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,59654,STATION WAGON / SPORT UTILITY VEHICLE,LIVERY VEHICLE
1847982,07/03/2012,17:30,QUEENS,11102,4.7747112,-73.9333863,"(40.7747112, -73.9333863)",27 AVE,4 ST,,...,0,0,0,2,0,FAILURE TO YIELD RIGHT-OF-WAY,UNSPECIFIED,272592,LIVERY VEHICLE,STATION WAGON / SPORT UTILITY VEHICLE
1847983,07/01/2012,15:30,BROOKLYN,11236,4.645318,-73.9199775,"(40.6450318, -73.9199775)",RALPH AVE,CLARENDON RD,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,135041,SMALL COM VEH,LIVERY VEHICLE
1847984,07/08/2012,18:30,,,4.7861217,-73.84782,"(40.7861217, -73.8040782)",,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,3055617,LIVERY VEHICLE,LIVERY VEHICLE


In [47]:
profile = dataset_profile(df)
profile.stats()



Unnamed: 0,total,empty,distinct,uniqueness,entropy
CRASH DATE,1816998,0,3446,0.001897,11.68755
CRASH TIME,1816998,0,1440,0.000793,8.928794
BOROUGH,1816998,541077,10,8e-06,2.118758
ZIP CODE,1816998,541299,246,0.000193,7.216658
LATITUDE,1816998,187752,95322,0.058507,15.289679
LONGITUDE,1816998,187752,77649,0.047659,14.945577
LOCATION,1816998,187732,244553,0.1501,16.191443
ON STREET NAME,1816998,350405,8912,0.006077,10.306346
CROSS STREET NAME,1816998,638339,9290,0.007882,10.810708
OFF STREET NAME,1816998,1532798,167426,0.589113,16.75037


In [48]:
profile.types()

Unnamed: 0,date,float,int,str,unknown
CRASH DATE,3446,0,0,0,0
CRASH TIME,0,0,0,1440,0
BOROUGH,0,0,0,10,0
ZIP CODE,0,0,246,0,0
LATITUDE,0,95322,0,0,0
LONGITUDE,0,77648,1,0,0
LOCATION,0,0,0,244536,17
ON STREET NAME,0,0,0,8912,0
CROSS STREET NAME,3,0,0,9287,0
OFF STREET NAME,1262,0,0,166164,0


# Output
Export the final result to a new csv file

In [49]:
# Export
compression_opts = dict(method='zip', archive_name='out.csv')  
df.to_csv('out.zip', index=False, compression=compression_opts)