# Reproducibility evaluation

This reproducibility evaluation report is produced by group 14, which includes Shourie Sai(sk9247), Huimin Zhang(hz2466) and Xinyu Zhang(xz3369). We are trying to evaluate the job of group 20 and their github repository link is https://github.com/hd2327/bigdata-final-project.

### 1. Reproducibility of the Codes

###### (1) data_cleaning_improve.ipynb

This data-cleaning jupyter notebook works well. All dependencies were declared and we don't have to install new packages.

###### (2) data_analysis.zpln

The instructions for running this Zeppelin notebook are fairly good except for a small problem --file names are not matched. The name of the file they put on the HDFS Peel is "out.csv", but in the zeppelin file, the filename they used is "data_cleaning_output.csv". We change the filename in the command to “out.csv” and it works well.

![Image2](https://github.com/XinyuZhangXvX/BigDataFinalPorject/blob/master/02.png?raw=true)

The result we ran has a small difference with the result they ran, but the difference is acceptable because the dataset is real time updating so we ran more data than they previously did. 

###### (3) data_visualization.ipynb

This data-visualization jupyter notebook works well. All dependencies were declared and we don't have to install new packages.

### 2. Strengths and Weaknesses

###### （1）Fill Geographic Information

For this part, it's wonderful to use Geopy to fill in the missing values. But it's also mentioned in their notebook that the Geocoding API has usage limits of 50 requests per second, it will take too long to fix all the missing values from the datasets. The improvement they can make to overcome this limitation is:

a) Find if there is any API that can deal with bulk data processing, which means that thousands of geocodings will be processed when they call the API once.

b) Find any reference data to transform the geodatas directly instead of depending on the API to get geo information. For example, they can fill the boroughs with reference to zipcodes by a mapping from all the zipcodes in NYC to their corresponding boroughs, and they can fill the zipcodes with reference to the latitudes and longitudes by the geojson data of NYC. 

Another shortcoming is that they seem not to analyze the functional dependencies in this dataset. There might be some conflicts existing in the dataset, like one zipcode might match with two different boroughs.

###### （2）Typographical Errors

For this part, they did a good job and removed most of the typos. They converted the street names to upper case, then they removed leading, trailing spaces, whitespaces, and punctuation marks. They used the custom standardizer StandardizeUSStreetName for US street names to standardize type errors. When we applied clustering algorithm on the street columns, there were different representations of the same street names in the clusters, however they were fewer in number, they seemed to have reduced most of the typos.

![Image1](https://github.com/XinyuZhangXvX/BigDataFinalPorject/blob/master/01.PNG?raw=true)

###### (3) Output format

In their outputs of data_analysis.zpln, they wrongly print "\t" instead of a tab between boroughs and numbers in each string.This small problem should be corrected.

![Image3](https://github.com/XinyuZhangXvX/BigDataFinalPorject/blob/master/03.png?raw=true)

### 3. Their original notebook is given below:

-------

# NYC Motor Vehicle Collisions - Data Profiling and Data Cleaning
In the following we are going to profile and clean the NYC Motor Vehicle Collisions dataset, which contains collisions occured from 2012 to 2021. The dataset consists of over 1.8 million rows and the compressed data file is about 73 MB.
We will use [`OpenClean`](https://github.com/VIDA-NYU/openclean)  and [`geopy`](https://pypi.org/project/geopy/) to profile and clean the data.

Before we start, let us configure the environment
  
  `pip install openclean`

  `pip install openclean-geo`
  
  `pip install geopy`

  `pip install humanfriendly`

In [1]:
pip install openclean




In [2]:
pip install openclean-geo

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install humanfriendly

Note: you may need to restart the kernel to use updated packages.


# Datasets and Streams
The identifier of the vehicle collisions dataset is `h9gi-nx95`. The following code downloads the dataset in tab-delimited CSV format and it will be stored in a local file called `h9gi-nx95.tsv.gz`.

In [5]:
# Download the full 'Motor Vehicle Collisions - Crashes' dataset.

import gzip
import humanfriendly
import os
from openclean.data.source.socrata import Socrata

dataset = Socrata().dataset('h9gi-nx95')

datafile = './h9gi-nx95.tsv.gz'

# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)

fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Downloading ...

Using 'Motor Vehicle Collisions - Crashes' in file ./h9gi-nx95.tsv.gz of size 73.42 MB


In [6]:
# Open the downloaded dataset to extract the relevant columns and records.

from openclean.pipeline import stream
import pandas as pd
import numpy as np

datafile = './h9gi-nx95.tsv.gz'
ds = stream(datafile)
df_full = ds.to_df()

According to the entropy (we will calculate later) of each attribute, some data have little impact on the results, like CONTRIBUTING FACTOR VEHICLE 3-5 and VEHICLE TYPE CODE 3-5. So we will remove them.

In [7]:
# select the subset of columns

df = df_full[['CRASH DATE',
            'CRASH TIME',
            'BOROUGH',
            'ZIP CODE',
            'LATITUDE',
            'LONGITUDE',
            'LOCATION',
            'ON STREET NAME', 
            'CROSS STREET NAME',
            'OFF STREET NAME',
            'NUMBER OF PERSONS INJURED',
            'NUMBER OF PERSONS KILLED',
            'NUMBER OF PEDESTRIANS INJURED',
            'NUMBER OF PEDESTRIANS KILLED',
            'NUMBER OF CYCLIST INJURED',
            'NUMBER OF CYCLIST KILLED',
            'NUMBER OF MOTORIST INJURED',
            'NUMBER OF MOTORIST KILLED',
            'CONTRIBUTING FACTOR VEHICLE 1',
            'CONTRIBUTING FACTOR VEHICLE 2',
            'COLLISION_ID',
            'VEHICLE TYPE CODE 1',
            'VEHICLE TYPE CODE 2']]

# Data Profiling
We use the default column profiler from `openclean` to compute basic statistics as the number of empty values, distinct values, etc.

In [8]:
# Profile the resulting dataset view using the default data profiler
from openclean.profiling.dataset import dataset_profile

profile = dataset_profile(df)

In [9]:
# Print overview of profiling results

profile.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
CRASH DATE,1850023,0,3453,0.001866,11.689576
CRASH TIME,1850023,0,1440,0.000778,8.931776
BOROUGH,1850023,571771,5,4e-06,2.118396
ZIP CODE,1850023,571993,232,0.000182,7.221087
LATITUDE,1850023,215823,122570,0.075003,15.634731
LONGITUDE,1850023,215823,96081,0.058794,15.343019
LOCATION,1850023,215823,245180,0.150031,16.188122
ON STREET NAME,1850023,378699,16167,0.010988,10.604661
CROSS STREET NAME,1850023,666765,19282,0.016296,11.809651
OFF STREET NAME,1850023,1563693,181034,0.632256,16.925978


In [10]:
# Print data types for each column.
profile.types()

Unnamed: 0,date,float,int,str
CRASH DATE,3453,0,0,0
CRASH TIME,0,0,0,1440
BOROUGH,0,0,0,5
ZIP CODE,0,0,231,1
LATITUDE,0,122569,1,0
LONGITUDE,0,96079,2,0
LOCATION,0,0,0,245180
ON STREET NAME,87,0,16,16064
CROSS STREET NAME,2,1,28,19251
OFF STREET NAME,31,0,1,181002


According to the overview of profiling results and data types for each column, some problems can be very obvious, like empty data and error data. So,a simple check of the data of each columun is necessary before cleaning it.

## CRASH DATE

In [11]:
# Print the minimum and maximum value for column 'CRASH DATE'
profile.minmax('CRASH DATE')

Unnamed: 0,min,max
date,2012-07-01,2021-12-13


In [12]:
# Using the default settings yields two outliers.

from openclean.profiling.anomalies.sklearn import DBSCANOutliers

crashdate = ds.distinct('CRASH DATE')

errorList = DBSCANOutliers().find(crashdate)

In [13]:
if(len(errorList)!=0):
    print("deal!")
    df = df.drop(df['CRASH DATE'].isin(errorList).index)

## CRASH TIME

In [14]:
# Print the minimum and maximum value for column 'CRASH TIME'
profile.minmax('CRASH TIME')

Unnamed: 0,min,max
str,0:00,9:59


In [15]:
# Using the default settings yields two outliers.

from openclean.profiling.anomalies.sklearn import DBSCANOutliers

crashtime = ds.distinct('CRASH TIME')

errorList = DBSCANOutliers().find(crashtime)

In [16]:
errorList

[]

In [17]:
if(len(errorList)!=0):
    print("deal!")
    df = df.drop(df['CRASH TIME'].isin(errorList).index)

## BOROUGH

In [18]:
profile.minmax('BOROUGH')

Unnamed: 0,min,max
str,BRONX,STATEN ISLAND


In [19]:
# Get set of distinct values for column 'Registration State'. Print the
# values in decreasing order of frequency.

states = ds.distinct('BOROUGH')
for rank, val in enumerate(states.most_common()):
    st, freq = val
    print(f'{rank + 1:<3} {st}  {freq:>10,}')

1        571,771
2   BROOKLYN     402,550
3   QUEENS     342,400
4   MANHATTAN     293,647
5   BRONX     185,816
6   STATEN ISLAND      53,839


## ZIP CODE

In [20]:
profile.minmax('ZIP CODE')

Unnamed: 0,min,max
int,10000.0,11697.0
str,,


In [21]:
# Get set of distinct values for column 'Registration State'. Print the
# values in decreasing order of frequency.

states = ds.distinct('ZIP CODE')
for rank, val in enumerate(states.most_common()):
    st, freq = val
    print(f'{rank + 1:<3} {st}  {freq:>10,}')

1        571,993
2   11207      24,037
3   11101      17,344
4   11236      16,742
5   11203      16,207
6   10019      16,070
7   11385      15,876
8   11234      15,837
9   11201      15,559
10  10016      15,527
11  10036      15,434
12  11226      14,954
13  10022      14,917
14  10001      14,876
15  11212      14,873
16  11208      14,823
17  11434      14,745
18  10002      13,638
19  10013      13,532
20  11233      13,231
21  10467      12,754
22  11230      12,622
23  11206      12,530
24  11220      12,173
25  11211      12,075
26  11368      11,876
27  11373      11,407
28  11377      11,395
29  11354      11,385
30  11235      11,227
31  10018      11,141
32  11213      11,055
33  10466      10,795
34  11217      10,627
35  10458      10,429
36  11210      10,360
37  11223      10,278
38  11215      10,210
39  11432      10,136
40  10011      10,129
41  11355      10,083
42  11221       9,957
43  11372       9,840
44  11219       9,839
45  10457       9,718
46  10451      

In [22]:
# remove blank
df.loc[(df['ZIP CODE'].str.contains(' ', regex=False))] = ''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [23]:
# New York zip code

zipcodes = ['10001', '10002', '10003', '10004', '10005', '10006', '10007', '10009', '10010', '10011', '10012', '10013',
            '10014', '10015', '10016', '10017', '10018', '10019', '10020', '10021', '10022', '10023', '10024', '10025',
            '10026', '10027', '10028', '10029', '10030', '10031', '10032', '10033', '10034', '10035', '10036', '10037',
            '10038', '10039', '10040', '10041', '10044', '10045', '10048', '10055', '10060', '10065', '10069', '10075',
            '10090', '10095', '10098', '10099', '10103', '10104', '10105', '10106', '10107', '10110', '10111', '10112',
            '10115', '10118', '10119', '10120', '10121', '10122', '10123', '10128', '10151', '10152', '10153', '10154',
            '10155', '10158', '10161', '10162', '10165', '10166', '10167', '10168', '10169', '10170', '10171', '10172',
            '10173', '10174', '10175', '10176', '10177', '10178', '10179', '10199', '10270', '10271', '10278', '10279',
            '10280', '10281', '10282', '10301', '10302', '10303', '10304', '10305', '10306', '10307', '10308', '10309',
            '10310', '10311', '10312', '10314', '10451', '10452', '10453', '10454', '10455', '10456', '10457', '10458',
            '10459', '10460', '10461', '10462', '10463', '10464', '10465', '10466', '10467', '10468', '10469', '10470',
            '10471', '10472', '10473', '10474', '10475', '11001', '11004', '11101', '11102', '11103', '11104', '11105',
            '11106', '11109', '11201', '11203', '11204', '11205', '11206', '11207', '11208', '11209', '11210', '11211',
            '11212', '11213', '11214', '11215', '11216', '11217', '11218', '11219', '11220', '11221', '11222', '11223',
            '11224', '11225', '11226', '11228', '11229', '11230', '11231', '11232', '11233', '11234', '11235', '11236',
            '11237', '11238', '11239', '11241', '11242', '11243', '11249', '11251', '11252', '11256', '11351', '11354',
            '11355', '11356', '11357', '11358', '11359', '11360', '11361', '11362', '11363', '11364', '11365', '11366',
            '11367', '11368', '11369', '11370', '11371', '11372', '11373', '11374', '11375', '11377', '11378', '11379',
            '11385', '11411', '11412', '11413', '11414', '11415', '11416', '11417', '11418', '11419', '11420', '11421',
            '11422', '11423', '11426', '11427', '11428', '11429', '11430', '11432', '11433', '11434', '11435', '11436',
            '11691', '11692', '11693', '11694', '11695', '11697']

In [24]:
# remove not in new york zipcode
df = df.drop(df.loc[(~df['ZIP CODE'].isin(zipcodes)) &( df['ZIP CODE'] != '')].index)

## LOCATION

In [25]:
# remove 0
df['LATITUDE'] = df['LATITUDE'].replace('0', '', regex = True)
df['LONGITUDE'] = df['LONGITUDE'].replace('0', '', regex = True)
df['LOCATION'] = df['LOCATION'].replace('(0.0, 0.0)', '', regex = False)

# Data Cleaning



## Missing Geographic Information

As for motor collision data, geographic attributes are vital. This is also reflected in the entropy of each attribute. So, if all the vital information is missing, the record is useless, it will be deleted. Specifically, the row with empty BOROUGH, ZIP CODE, STREET NAME(ON/OFF/CROSS) and LOCATION(LATITUDE/ LONGITUDE) will be deleted.


In [26]:
# drop all records of lost geographical attributes
df = df.drop(df[(df['LOCATION'] == '') & \
                (df['ON STREET NAME'] == '') & \
                (df['OFF STREET NAME'] == '') &  \
                (df['CROSS STREET NAME'] == '') & \
                (df['BOROUGH'] == '') & \
                (df['ZIP CODE'] == '')].index)

For all the missing longtitude/latitude/on street name/cross street name/off street name, there will be difficult to calculate the location through Map API to navigate the other values

In [27]:
# drop all records that cannot calculate location 
df = df.drop(df[((df['LONGITUDE'] == '')  | (df['LONGITUDE'] == '0')) &\
                (df['ON STREET NAME'] == '') & (df['OFF STREET NAME'] == '') &  
                (df['CROSS STREET NAME'] == '')].index)

## Fill Geographic Information
For calculating other missing geological values, importing geopy library into the data cleaning process. With specific LONGITUDE and LATITUDE, using librabry to get zipcodes and address is farly easy and tidy

However, the problem of using Geocoding API is it has usage limits which only allows 50 requests per second; Ideally, if we process 50 records per second, it costs approximately 25 days to process the whole dataset which is not time efficient. We did not find a better way to solve this problem, so we only calculate 20 records for demostration.

In [28]:
# Using an open source library geopy to fill empty location
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="data-cleaning-project")

# according to location to gat all geographic cloumn
def find_ZIPCODE(x):
    location = geolocator.reverse(x['LOCATION'][1:-1])
    list = location.address.split(",")
    
    # get index of x
    dfindex = df.index
    condition = df['COLLISION_ID'] == x['COLLISION_ID']
    x_indices = dfindex[condition]
    x_indices_list = x_indices.tolist()
    index = x_indices_list[0]
    
    ZIPCODE = list[-2]
    BOROUGH = list[-5]
    OFFSTREETNAME = ','.join(list[0:-6])
    
    df.at[index,'BOROUGH'] = BOROUGH
    df.at[index,'ZIP CODE'] = ZIPCODE
    df.at[index,'OFF STREET NAME'] = OFFSTREETNAME

In [29]:
df.loc[df['LOCATION'] != ''].head(20).apply(find_ZIPCODE, axis=1)
print()




In [30]:
def find_zipcode(location):
    address = geolocator.geocode(location)
    arr = address.raw['display_name'].split(', ')
    return arr[len(arr) - 2]

def find_borough(location):
    address = geolocator.geocode(location)
    arr = address.raw['display_name'].split(', ')
    return arr[len(arr) - 5]

def find_street(location):
    address = geolocator.geocode(location)
    arr = address.raw['display_name'].split(', ')
    street = ''
    for i in range(1, len(arr) - 5):
        street += arr[i] + ' '
    return street

def find_latlng(street):
    location = geolocator.geocode(street)
    try:
        lat = location.latitude
        lng = location.longitude
    except AttributeError:
        lat = 0
        lng = 0
    return (lat, lng)

def find_location(x):
    if x['ON STREET NAME'] != '':
        return find_latlng(x['ON STREET NAME'])
    elif x['CROSS STREET NAME'] != '':
        return find_latlng(x['CROSS STREET NAME'])
    elif x['OFF STREET NAME'] != '':
        return find_latlng(x['OFF STREET NAME'])
    else:
        return x['LOCATION']

In [31]:
# find location, latitude, langitude by street name (only)
df.loc[df['LOCATION'] == '', 'LOCATION'] = df.loc[df['LOCATION'] == ''].head(20).apply(find_location, axis=1)

## Uppercase
Standardizing all the String type value is clear for professionals to analyze in the future

In [32]:
# Change all words to upper case
df.apply(lambda x: x.astype(str).str.upper())

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
0,04/14/2021,5:32,,,,,"(40.80205005, -73.8297471344276)",BRONX WHITESTONE BRIDGE,,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407480,SEDAN,SEDAN
1,04/13/2021,21:35,BROOKLYN,11217,4.68358,-73.97617,"(40.68358, -73.97617)",,,"ATLANTIC CENTER, 625, ATLANTIC AVENUE",...,0,0,0,0,0,UNSPECIFIED,,4407147,SEDAN,
2,04/15/2021,16:15,,,,,"(40.849642349999996, -73.83640393750001)",HUTCHINSON RIVER PARKWAY,,,...,0,0,0,0,0,PAVEMENT SLIPPERY,,4407665,STATION WAGON/SPORT UTILITY VEHICLE,
3,04/13/2021,16:00,BROOKLYN,11222,,,"(42.083058, -76.05075)",VANDERVORT AVENUE,ANTHONY STREET,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407811,SEDAN,
4,04/12/2021,8:25,,,,,"(41.5475152, -73.0123417)",EDSON AVENUE,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,4406885,STATION WAGON/SPORT UTILITY VEHICLE,SEDAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1850018,07/06/2012,15:09,MANHATTAN,10035,4.812354,-73.9418153,"(40.8012354, -73.9418153)",EAST 119 STREET,PARK AVENUE,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,59654,SPORT UTILITY / STATION WAGON,PASSENGER VEHICLE
1850019,07/03/2012,17:30,QUEENS,11102,4.7747112,-73.9333863,"(40.7747112, -73.9333863)",27 AVENUE,4 STREET,,...,0,0,0,2,0,FAILURE TO YIELD RIGHT-OF-WAY,UNSPECIFIED,272592,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON
1850020,07/01/2012,15:30,BROOKLYN,11236,4.645318,-73.9199775,"(40.6450318, -73.9199775)",RALPH AVENUE,CLARENDON ROAD,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,135041,SMALL COM VEH(4 TIRES),PASSENGER VEHICLE
1850021,07/08/2012,18:30,,,4.7861217,-73.84782,"(40.7861217, -73.8040782)",,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,3055617,PASSENGER VEHICLE,PASSENGER VEHICLE


## Street Format
Normalize the street name to more proper format by using `StandardizeUSStreetName` from `openclean_geo`

In [33]:
# '158 st      '
df['ON STREET NAME'] = df['ON STREET NAME'].map(lambda x: x.strip())
df['OFF STREET NAME'] = df['OFF STREET NAME'].map(lambda x: x.strip())
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].map(lambda x: x.strip())
# '60-30      30'
df['ON STREET NAME'] = df['ON STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['OFF STREET NAME'] = df['OFF STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)

In [34]:
# Use street name standardization operator to modify street names
from openclean_geo.address.usstreet import StandardizeUSStreetName
f = StandardizeUSStreetName(characters='upper', alphanum=True, repeated=False)
df['ON STREET NAME'] = f.apply(df['ON STREET NAME'], threads=3)
df['CROSS STREET NAME'] = f.apply(df['CROSS STREET NAME'], threads=3)
df['OFF STREET NAME'] = f.apply(df['OFF STREET NAME'], threads=3)

In [35]:
# '158 st      '
df['ON STREET NAME'] = df['ON STREET NAME'].map(lambda x: x.strip())
df['OFF STREET NAME'] = df['OFF STREET NAME'].map(lambda x: x.strip())
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].map(lambda x: x.strip())
# '60-30      30'
df['ON STREET NAME'] = df['ON STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['OFF STREET NAME'] = df['OFF STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)

## Error  Data Type 

There are some rows with error data type，like int and date type in ON STREET NAME, CROSS STREET NAME, OFF STREET NAME. Regular expression is a good way to solve it.

In [36]:
# Replace the error type data like int, date
df['ON STREET NAME'] = df['ON STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['CROSS STREET NAME'] = df['CROSS STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)
df['OFF STREET NAME'] = df['OFF STREET NAME'].str.replace('^[a-z\d\-_\s]+$', '', regex = True)

## Missing Data
In the vehicle collision data set, there are some attributes that are missing important information. For example,  with missing value - CONTRIBUTING FACTOR VEHICLE 1-5, we can fill up with ‘UNSPECIFIED’. 


In [37]:
# Fill empty values with 'Unspecified'
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].replace('', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].replace('', 'UNSPECIFIED', regex = True)

As for VEHICLE TYPE CODE 1-2，we can fill up with ‘UNKNOWN’.

In [38]:
df['VEHICLE TYPE CODE 1'] = df['VEHICLE TYPE CODE 1'].replace('', 'UNKNOWN', regex = True)
df['VEHICLE TYPE CODE 2'] = df['VEHICLE TYPE CODE 2'].replace('', 'UNKNOWN', regex = True)

Also from Column 10-17, ‘NUMBER OF PERSONS INJURED’ to NUMBER OF MOTORIST INJURED, replacing the empty value in attributes NUMBER OF PERSONS INJURED/NUMBER OF PERSONS KILLED to 0.

In [39]:
# Replace empty values with '0'
df['NUMBER OF PERSONS INJURED'] = df['NUMBER OF PERSONS INJURED'].str.upper().replace('', '0', regex = True)
df['NUMBER OF PERSONS KILLED'] = df['NUMBER OF PERSONS KILLED'].str.upper().replace('', '0', regex = True)

## Type Error
There are so many type errors, like confusion of case, missing or adding characters and just only errors.

As for CONTRIBUTING FACTOR VEHICLE, some numbers are uselessful, so they will be replaced by “UNSPECIFIED”. 


In [40]:
# CONTRIBUTING FACTOR VEHICLE
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].str.upper().replace('ILLNES', 'ILLNESS', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].str.upper().replace('80', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].str.upper().replace('1', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].str.upper().replace('ILLNES', 'ILLNESS', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].str.upper().replace('80', 'UNSPECIFIED', regex = True)
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].str.upper().replace('1', 'UNSPECIFIED', regex = True)

VEHICLE TYPE CODE is more complex than CONTRIBUTING FACTOR VEHICLE. There are thousands of vehicle types and type errors. For example, just for the ambulance type, there are kinds of spelling, 'AMB', 'AMBU', 'AMBUKANCE', 'AMBUL', 'AMBULACE', 'AMBULANCE', 'AMBULANE', 'AMBULENCE', 'AMBULETTE', 'AMDU', 'AMUBULANCE', 'AMULANCE'. So applying [Standardizing Spellings](https://github.com/VIDA-NYU/openclean/blob/master/examples/notebooks/Standardization%20of%20Ethiopian%20Calendar%20and%20Woreda%20Names.ipynb) is necessary.


In [41]:
# Create a Matcher to match vehicle type
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.function.matching.base import DefaultStringMatcher

vehicle_type = set(['SEDAN','4 DR SEDAN','2 DR SEDAN','MOTORCYCL','TAXI',
                    'VAN','TRUCK','BUS','BIKE','MOTORCYCLE',
                    'STATION WAGON / SPORT UTILITY VEHICLE',
                    'LARGE COM VEH','SMALL COM VEH','OTHER',
                    'E-BIKE','E-SCOOTER','AMBULANCE','UNKNOWN',
                    'LIVERY VEHICLE','TRACTOR TRUCK DIESEL',
                    'CONVERTIBLE','DUMP','FDNY','USPS','TANK'])

matcher = DefaultStringMatcher(
            vocabulary = vehicle_type,
            similarity = FuzzySimilarity(),
            best_matches_only=True,
            no_match_threshold=0.2,
            cache_results = True)

def standardizeVehicleType(x):
    vtype = ""
    try:
        vtype = matcher.find_matches(x)[0].term
    except TypeError:
        vtype = "UNKNOWN"
    except IndexError:
        vtype = "UNKNOWN"
    return vtype

Applying the function consturcted above to apply to attributes VEHICLE TYPE CODE1 AND VEHICLE TYPE CODE 2

In [42]:
# Apply standardize vehicle type method on VEHICLE TYPE CODE 1 and VEHICLE TYPE CODE 2
df["VEHICLE TYPE CODE 1"] = df["VEHICLE TYPE CODE 1"].map(standardizeVehicleType)
df["VEHICLE TYPE CODE 2"] = df["VEHICLE TYPE CODE 2"].map(standardizeVehicleType)

In [43]:
df['BOROUGH'] = df['BOROUGH'].str.upper().replace('THE BRONX', 'BRONX', regex = True)
df['BOROUGH'] = df['BOROUGH'].map(lambda x: x.strip())

In [44]:
df

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2
0,04/14/2021,5:32,,,,,"(40.80205005, -73.8297471344276)",BRONX WHITESTONE BRG,,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407480,SEDAN,SEDAN
1,04/13/2021,21:35,BROOKLYN,11217,4.68358,-73.97617,"(40.68358, -73.97617)",,,ATLANTIC CENTER 625 ATLANTIC AVE,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,4407147,SEDAN,UNKNOWN
2,04/15/2021,16:15,,,,,"(40.849642349999996, -73.83640393750001)",HUTCHINSON RIVER PKWY,,,...,0,0,0,0,0,PAVEMENT SLIPPERY,UNSPECIFIED,4407665,STATION WAGON / SPORT UTILITY VEHICLE,UNKNOWN
3,04/13/2021,16:00,BROOKLYN,11222,,,"(42.083058, -76.05075)",VANDERVORT AVE,ANTHONY ST,,...,0,0,0,0,0,FOLLOWING TOO CLOSELY,UNSPECIFIED,4407811,SEDAN,UNKNOWN
4,04/12/2021,8:25,,,,,"(41.5475152, -73.0123417)",EDSON AVE,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,4406885,STATION WAGON / SPORT UTILITY VEHICLE,SEDAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1850018,07/06/2012,15:09,MANHATTAN,10035,4.812354,-73.9418153,"(40.8012354, -73.9418153)",EAST 119 ST,PARK AVE,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,59654,STATION WAGON / SPORT UTILITY VEHICLE,LIVERY VEHICLE
1850019,07/03/2012,17:30,QUEENS,11102,4.7747112,-73.9333863,"(40.7747112, -73.9333863)",27 AVE,4 ST,,...,0,0,0,2,0,FAILURE TO YIELD RIGHT-OF-WAY,UNSPECIFIED,272592,LIVERY VEHICLE,STATION WAGON / SPORT UTILITY VEHICLE
1850020,07/01/2012,15:30,BROOKLYN,11236,4.645318,-73.9199775,"(40.6450318, -73.9199775)",RALPH AVE,CLARENDON RD,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,135041,SMALL COM VEH,LIVERY VEHICLE
1850021,07/08/2012,18:30,,,4.7861217,-73.84782,"(40.7861217, -73.8040782)",,,,...,0,0,0,0,0,UNSPECIFIED,UNSPECIFIED,3055617,LIVERY VEHICLE,LIVERY VEHICLE


In [45]:
profile = dataset_profile(df)
profile.stats()



Unnamed: 0,total,empty,distinct,uniqueness,entropy
CRASH DATE,1819034,0,3453,0.001898,11.690182
CRASH TIME,1819034,0,1440,0.000792,8.929469
BOROUGH,1819034,541747,5,4e-06,2.118496
ZIP CODE,1819034,541969,246,0.000193,7.216651
LATITUDE,1819034,187897,95351,0.058457,15.29003
LONGITUDE,1819034,187897,77655,0.047608,14.9454
LOCATION,1819034,187877,244974,0.150184,16.193242
ON STREET NAME,1819034,350959,8914,0.006072,10.306203
CROSS STREET NAME,1819034,639437,9290,0.007876,10.810802
OFF STREET NAME,1819034,1534285,167692,0.588912,16.752503


In [46]:
profile.types()

Unnamed: 0,date,float,int,str,unknown
CRASH DATE,3453,0,0,0,0
CRASH TIME,0,0,0,1440,0
BOROUGH,0,0,0,5,0
ZIP CODE,0,0,246,0,0
LATITUDE,0,95351,0,0,0
LONGITUDE,0,77654,1,0,0
LOCATION,0,0,0,244957,17
ON STREET NAME,0,0,0,8914,0
CROSS STREET NAME,3,0,0,9287,0
OFF STREET NAME,1265,0,0,166427,0


# Output
Export the final result to a new csv file

In [47]:
# Export
compression_opts = dict(method='zip', archive_name='out.csv')  
df.to_csv('out.zip', index=False, compression=compression_opts)