# PART 1: SOURCING AND CLEANING DATA

In this notebook, I'll be creating 3 dataframes to analyze 311 complaints in NYC in 2017. The databases are the following:

1. NYC 311 Service Request Data
2. 2010 Census Population Data
3. Zip codes for NYC boroughs

In [116]:
import os
import sys
import time
import requests
import csv

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt

from io import StringIO

In [117]:
def printTimeElapsed(start):
    elapsed = time.time() - start
    print(time.strftime("%M:%S", time.gmtime(elapsed)))

## NYC 311 Service Request Data
First, I'll take a few probes at the NYC 311 database to see what I'm working with. Much of this has been removed for readability.

In [118]:
# Get application token from local file, if available

try:
    with open('api_nycOpenData.csv') as f:
        reader = csv.reader(f)
        api_vals = next(reader)
    HEADERS = {api_vals[0] : api_vals[1]}
    print('File found')
    
except:
    print('File not found')
    HEADERS = {}

File found


In [119]:
pd.set_option('display.max_columns', 30)

api_URL = "https://data.cityofnewyork.us/resource/fhrw-4uyv.csv"

# Start off with a simple call of 2019 complaints to get a feel for the dataset
PARAMS = {'$where': 'date_extract_y(created_date)=2019'}

resp = requests.get(api_URL, headers=HEADERS, params=PARAMS)
csv_IO = StringIO(resp.text)
resp_df = pd.read_csv(csv_IO)
resp_df

Unnamed: 0,address_type,agency,agency_name,bbl,borough,bridge_highway_direction,bridge_highway_name,bridge_highway_segment,city,closed_date,community_board,complaint_type,created_date,cross_street_1,cross_street_2,...,longitude,open_data_channel_type,park_borough,park_facility_name,resolution_action_updated_date,resolution_description,road_ramp,status,street_name,taxi_company_borough,taxi_pick_up_location,unique_key,vehicle_type,x_coordinate_state_plane,y_coordinate_state_plane
0,ADDRESS,HPD,Department of Housing Preservation and Develop...,4.021130e+09,QUEENS,,,,Rego Park,2019-02-06T21:43:59.000,06 QUEENS,HEAT/HOT WATER,2019-02-03T14:32:26.000,,,...,-73.853972,ONLINE,QUEENS,Unspecified,2019-02-06T21:43:59.000,The Department of Housing Preservation and Dev...,,Closed,64 ROAD,,,41631768,,1024722.0,205591.0
1,ADDRESS,HPD,Department of Housing Preservation and Develop...,3.020070e+09,BROOKLYN,,,,BROOKLYN,2019-02-06T02:08:59.000,02 BROOKLYN,HEAT/HOT WATER,2019-01-31T21:18:40.000,,,...,-73.971047,PHONE,BROOKLYN,Unspecified,2019-02-06T02:08:59.000,The Department of Housing Preservation and Dev...,,Closed,CARLTON AVENUE,,,41631750,,992280.0,188524.0
2,ADDRESS,HPD,Department of Housing Preservation and Develop...,3.014060e+09,BROOKLYN,,,,BROOKLYN,2019-02-06T09:19:57.000,09 BROOKLYN,PAINT/PLASTER,2019-01-21T01:16:38.000,,,...,-73.937300,PHONE,BROOKLYN,Unspecified,2019-02-06T09:19:57.000,The Department of Housing Preservation and Dev...,,Closed,PRESIDENT STREET,,,41631762,,1001644.0,182410.0
3,ADDRESS,HPD,Department of Housing Preservation and Develop...,3.019570e+09,BROOKLYN,,,,BROOKLYN,2019-02-06T02:08:29.000,02 BROOKLYN,HEAT/HOT WATER,2019-02-02T17:00:04.000,,,...,-73.970678,MOBILE,BROOKLYN,Unspecified,2019-02-06T02:08:29.000,The Department of Housing Preservation and Dev...,,Closed,GREENE AVENUE,,,41631775,,992382.0,189258.0
4,ADDRESS,DOB,Department of Buildings,2.047640e+09,BRONX,,,,BRONX,2019-02-08T00:00:00.000,12 BRONX,General Construction/Plumbing,2019-02-06T15:04:33.000,,,...,-73.842767,UNKNOWN,BRONX,Unspecified,2019-02-08T00:00:00.000,The Department of Buildings investigated this ...,,Closed,MICKLE AVENUE,,,41631774,,1027735.0,257091.0
5,ADDRESS,HPD,Department of Housing Preservation and Develop...,1.016220e+09,MANHATTAN,,,,NEW YORK,2019-02-06T07:23:07.000,11 MANHATTAN,HEAT/HOT WATER,2019-02-03T09:42:16.000,,,...,-73.943279,PHONE,MANHATTAN,Unspecified,2019-02-06T07:23:07.000,The Department of Housing Preservation and Dev...,,Closed,PARK AVENUE,,,41631755,,999954.0,230778.0
6,ADDRESS,HPD,Department of Housing Preservation and Develop...,1.010410e+09,MANHATTAN,,,,NEW YORK,2019-02-06T15:46:26.000,04 MANHATTAN,APPLIANCE,2019-01-25T11:38:48.000,,,...,-73.987809,PHONE,MANHATTAN,Unspecified,2019-02-06T15:46:26.000,The Department of Housing Preservation and Dev...,,Closed,WEST 50 STREET,,,41631763,,987627.0,217264.0
7,ADDRESS,HPD,Department of Housing Preservation and Develop...,4.158100e+09,QUEENS,,,,Far Rockaway,2019-02-06T21:40:19.000,14 QUEENS,HEAT/HOT WATER,2019-02-04T11:52:37.000,,,...,-73.753206,PHONE,QUEENS,Unspecified,2019-02-06T21:40:19.000,The Department of Housing Preservation and Dev...,,Closed,BEACH 19 STREET,,,41631788,,1052790.0,155721.0
8,ADDRESS,HPD,Department of Housing Preservation and Develop...,3.012670e+09,BROOKLYN,,,,BROOKLYN,2019-02-06T21:31:35.000,09 BROOKLYN,HEAT/HOT WATER,2019-02-05T06:34:10.000,,,...,-73.954352,ONLINE,BROOKLYN,Unspecified,2019-02-06T21:31:35.000,The Department of Housing Preservation and Dev...,,Closed,EASTERN PARKWAY,,,41631778,,996913.0,183402.0
9,ADDRESS,HPD,Department of Housing Preservation and Develop...,3.008130e+09,BROOKLYN,,,,BROOKLYN,2019-02-06T21:50:46.000,07 BROOKLYN,ELECTRIC,2019-02-05T15:11:55.000,,,...,-74.017888,PHONE,BROOKLYN,Unspecified,2019-02-06T21:50:46.000,The Department of Housing Preservation and Dev...,,Closed,53 STREET,,,41631779,,979286.0,174967.0


In [120]:
# Do a count query to see how many records there are for 2017
PARAMS = {"$where": "date_extract_y(created_date)=2017",
          "$select": "count(created_date)"}
resp = requests.get(api_URL, headers=HEADERS, params=PARAMS)

totalRecords = int(resp.text.split('"')[-2])
print(f"There are {totalRecords} total records for 2017\n")

# Look through column fields to see which data may/not be relevant, get indices.
for idx, col in enumerate(resp_df.columns):
    print(idx, col)

There are 2461208 total records for 2017

0 address_type
1 agency
2 agency_name
3 bbl
4 borough
5 bridge_highway_direction
6 bridge_highway_name
7 bridge_highway_segment
8 city
9 closed_date
10 community_board
11 complaint_type
12 created_date
13 cross_street_1
14 cross_street_2
15 descriptor
16 due_date
17 facility_type
18 incident_address
19 incident_zip
20 intersection_street_1
21 intersection_street_2
22 landmark
23 latitude
24 location
25 location_address
26 location_city
27 location_state
28 location_type
29 location_zip
30 longitude
31 open_data_channel_type
32 park_borough
33 park_facility_name
34 resolution_action_updated_date
35 resolution_description
36 road_ramp
37 status
38 street_name
39 taxi_company_borough
40 taxi_pick_up_location
41 unique_key
42 vehicle_type
43 x_coordinate_state_plane
44 y_coordinate_state_plane


### Sources

There are 2 options for obtaining this NYC 311 data:
1. Download the full dataset (CSV file)
2. Use the Socrata API to query for the data

The full dataset is prohibitively large for my working environment, so I'm choosing the API route. I'm limiting each request to 50,000 records (the max value for v2.0 of this API. I *believe* this endpoint is 2.1 and has no such limit, but I'll keep it to be safe), and will be paging through the data. A count query shows that there are ~2.4M records for 2017, so I'll need to be a little structured about how I make the query.

Most requests finish in a few seconds, but the server seem to does slow down (~30s per request) every so often. Otherwise, this process typically takes about 5 minutes. I currently only kill the process on an unsuccessul response code, but more handling should be implemented for other codes and timeouts. The complete dataframe will get stored to a pickle file to avoid making these API requests for repeated runs.

I initially downloaded the full records so I could have more data to cross-reference for cleaning purposes (i.e. fill in missing zip code and borough fields). Currently when I convert to a dataframe, I throw most of this data away, but this could be used if a more thorough cleaning were undertaken.

Most columns have data of mixed type, so I opted to let Pandas assign a dtype of object/string. I do set the dtype as a string for the zip code since I don't need to do any numeric processing on it and several entries have non-numeric characters.

In [123]:
def buildNYCData(num_rows):
    df_list = list()
    
    num_offset = 0
    count = num_rows
    
    while count == num_rows:
        PARAMS = {"$where": "date_extract_y(created_date)=2017",
                  "$order": ":id", # Use Socrata default ordering for paging
                  "$limit": str(num_rows),
                  "$offset": str(num_offset)}

        start = time.time()
        resp = requests.get(api_URL, headers=HEADERS, params=PARAMS)
        printTimeElapsed(start)
        
        print("API Response: {}".format(resp.status_code))
        if resp.status_code != 200:
            print('API request failed')
            break

        start = time.time()
        csv_IO = StringIO(resp.text)
        df_list.append(pd.read_csv(csv_IO, dtype={'incident_zip': str},
                       usecols=['borough', 'city', 'complaint_type', 'created_date', 'incident_zip']) \
                               [['created_date', 'complaint_type', 'incident_zip', 'borough', 'city']])
                       #, low_memory=False))# Keeping the warnings

        count = df_list[-1].shape[0]
        num_offset += count
        printTimeElapsed(start)
        
        print('Records = {}\n'.format(num_offset))
        
    return pd.concat(df_list, ignore_index=True)

In [124]:
num_rows = 50000

# Don't query the data for repeated runs
if os.path.isfile('nyc1_df.pkl'):
    nyc1_df = pd.read_pickle('nyc1_df.pkl')
else:
    nyc1_df = buildNYCData(num_rows)

00:01
API Response: 200
00:00
Records = 50000

00:01
API Response: 200
00:00
Records = 100000

00:02
API Response: 200
00:00
Records = 150000

00:02
API Response: 200
00:00
Records = 200000

00:04
API Response: 200
00:00
Records = 250000

00:02
API Response: 200
00:00
Records = 300000

00:05
API Response: 200
00:00
Records = 350000

00:01
API Response: 200
00:00
Records = 400000

00:33
API Response: 200
00:00
Records = 450000

00:34
API Response: 200
00:00
Records = 500000

00:34
API Response: 200
00:00
Records = 550000

00:36
API Response: 200
00:00
Records = 600000

00:37
API Response: 200
00:00
Records = 650000

00:34
API Response: 200
00:00
Records = 700000

00:39
API Response: 200
00:00
Records = 750000

00:36
API Response: 200
00:00
Records = 800000

00:37
API Response: 200
00:00
Records = 850000

00:36
API Response: 200
00:00
Records = 900000

00:38
API Response: 200
00:00
Records = 950000

00:40
API Response: 200
00:00
Records = 1000000

00:38
API Response: 200
00:00
Records = 

  if (yield from self.run_code(code, result)):


00:00
Records = 2450000

00:38
API Response: 200
00:00
Records = 2461208



In [125]:
# Store in pickle file locally before cleanup
nyc1_df.to_pickle('nyc1_df.pkl')

 Now to inspect data and look for missing fields...
    
### complaint_type

As seen in the count list below, there are what seems to be duplicates (e.g. `'UNSANITARY CONDITION'` and `'Sanitation condition'`); however, the 311 submission fields online have categories and *subcategories*, many of which seem to change or be user submitted. These records seem to only have the subcategory, and so without context, it's possible I could be mixing complaints from entirely separate categories if I tried any combining.

I will, however, convert these strings to a uniform case to catch any matching categories. This will catch some extra `'PLUMBING'` complaints for example, but it falls outside the top 10.

In [126]:
nyc1_df['complaint_type'].value_counts(dropna=False).head(20)

Noise - Residential                    230152
HEAT/HOT WATER                         213521
Illegal Parking                        146122
Blocked Driveway                       136097
Street Condition                        93265
Street Light Condition                  84195
UNSANITARY CONDITION                    79282
Noise - Street/Sidewalk                 73085
Water System                            65100
Noise                                   60171
PAINT/PLASTER                           57076
PLUMBING                                49969
Noise - Commercial                      47394
Request Large Bulky Item Collection     46614
General Construction/Plumbing           43013
Sanitation Condition                    38937
Missed Collection (All Materials)       36260
Traffic Signal Condition                36178
Dirty Conditions                        35887
Rodent                                  35075
Name: complaint_type, dtype: int64

In [127]:
# Look at count list again after str.upper()
nyc1_df['complaint_type'] = nyc1_df['complaint_type'].str.upper()
nyc1_df['complaint_type'].value_counts(dropna=False).head(20)

NOISE - RESIDENTIAL                    230152
HEAT/HOT WATER                         213521
ILLEGAL PARKING                        146122
BLOCKED DRIVEWAY                       136097
STREET CONDITION                        93265
STREET LIGHT CONDITION                  84195
UNSANITARY CONDITION                    79282
NOISE - STREET/SIDEWALK                 73085
WATER SYSTEM                            65100
NOISE                                   60171
PAINT/PLASTER                           57076
PLUMBING                                56367
NOISE - COMMERCIAL                      47394
REQUEST LARGE BULKY ITEM COLLECTION     46614
GENERAL CONSTRUCTION/PLUMBING           43013
SANITATION CONDITION                    38937
MISSED COLLECTION (ALL MATERIALS)       36260
TRAFFIC SIGNAL CONDITION                36178
DIRTY CONDITIONS                        35887
RODENT                                  35075
Name: complaint_type, dtype: int64

### borough

There are "unspecified" boroughs that could likely be inferred from **incident_zip**, and that will be done later on with the aid of the NYC borough/zip code database.

Otherwise, there are *many* fields that could be used to fill in the blank (city, cross streets, intersection streets, park borough, etc.). Without more context from NYC 311's info page, it's unclear how to best correlate this data.

In [128]:
nyc1_df['borough'].value_counts(dropna=False)

BROOKLYN         771324
QUEENS           589983
MANHATTAN        480351
BRONX            450933
STATEN ISLAND    127141
Unspecified       41476
Name: borough, dtype: int64

In [129]:
bor_missing = len(nyc1_df[nyc1_df['borough'] == 'Unspecified'])
print('{} of {} borough records are missing. This represents {:.3f}% of our data.'
      .format(bor_missing, totalRecords, float(100*bor_missing)/totalRecords))

zc_without_bor = len(nyc1_df[(nyc1_df['incident_zip'].notna()) & (nyc1_df['borough'] == 'Unspecified')])
print('{} of these {} missing borough entries could be inferred from an available zip code.'
      .format(zc_without_bor, bor_missing))
print('This could correct up to {:.3f}% of our missing borough data'
      .format(100*float(zc_without_bor)/bor_missing))

41476 of 2461208 borough records are missing. This represents 1.685% of our data.
4440 of these 41476 missing borough entries could be inferred from an available zip code.
This could correct up to 10.705% of our missing borough data


### incident_zip

There are many missing and irregular entries for this field. I'm not sure there's a clear answer to filling in this data.

Clean up will involve making sure all zip codes are numeric and 5 characters or less.

In [130]:
nyc1_df['incident_zip'].value_counts(dropna=False)

NaN           117792
11226          42131
11385          35089
10467          34086
11207          30779
10453          29706
10458          28942
10452          28545
10468          28525
11208          28394
11221          26301
10456          26214
10457          25979
10031          25836
11206          25112
11209          24842
11225          24833
11234          24394
11213          24316
11233          23652
11203          23409
10472          23407
11212          23347
11216          23238
11230          21830
11220          21711
10314          21710
11215          21604
11238          21371
10032          20982
               ...  
11771              1
98036              1
31600              1
34134              1
43226              1
60179              1
08063              1
19007              1
110044             1
01583              1
10538              1
06840              1
12525              1
10426              1
89117              1
11735-9100         1
13202        

In [131]:
zip_missing = len(nyc1_df[nyc1_df['incident_zip'].isna()])
print('{} of {} zip code records are missing. This represents {:.3f}% of our data.'
      .format(zip_missing, totalRecords, float(100*zip_missing)/totalRecords))

117792 of 2461208 zip code records are missing. This represents 4.786% of our data.


There are several ways zip codes fields can be corrupted:
- 4 digit specifier may be appended (e.g. 10001-1234)
- Too short, too long
- Non-numeric characters

I'll fix some of these and set the rest to np.nan

In [132]:
# View all entries with dashed zip codes
nyc1_df[nyc1_df['incident_zip'].str.contains('-', na=False)]

Unnamed: 0,created_date,complaint_type,incident_zip,borough,city
13711,2017-01-03T12:24:17.000,CONSUMER COMPLAINT,90054-0807,Unspecified,LOS ANGELOUS
151419,2017-01-24T16:52:07.000,CONSUMER COMPLAINT,11802-9060,Unspecified,HICKSVILLE
279796,2017-02-13T06:29:56.000,TAXI REPORT,11973-5000,Unspecified,NEW YORK
286215,2017-02-14T17:07:34.000,CONSUMER COMPLAINT,02298-1002,Unspecified,BOSTON
336941,2017-02-21T15:36:34.000,CONSUMER COMPLAINT,07660-2112,Unspecified,RIDGEFIELD PARK
378836,2017-03-01T12:51:01.000,CONSUMER COMPLAINT,12550-0831,Unspecified,NEWBURGH
423811,2017-03-08T15:48:57.000,CONSUMER COMPLAINT,85285-7288,Unspecified,TEMPE
547246,2017-03-28T09:05:55.000,CONSUMER COMPLAINT,19714-7526,Unspecified,NEWARK
556493,2017-03-23T18:13:20.000,CONSUMER COMPLAINT,11802-9060,Unspecified,HICKSVILLE
639800,2017-04-13T16:14:28.000,CONSUMER COMPLAINT,32255-1268,Unspecified,JACKSONVILLE


This section will iterate through the zip code column looking for a '-'. If found, then the first 5 digits will preserved.

In [133]:
for idx, val in enumerate(nyc1_df['incident_zip']):
    if '-' in str(val):
        #print(val.split('-'))
        nyc1_df.loc[idx, 'incident_zip'] = val.split('-')[0]

Manually iterate again looking for strings that likely can't have a useful zip code extracted. Any entries that have non-numeric characters or are of the wrong length will be removed (i.e. NaN).

In [134]:
bad_val_count = 0
for idx, val in enumerate(nyc1_df['incident_zip']):
    if str(val) == 'nan':
        continue
    elif not str(val).isnumeric():
        bad_val_count += 1
        nyc1_df.loc[idx, 'incident_zip'] = np.nan
    elif len(str(val)) != 5:
        bad_val_count += 1
        nyc1_df.loc[idx, 'incident_zip'] = np.nan
        
print(f'{bad_val_count} values were identified')

60 values were identified


## 2010 Census Population Data

This database is available at Splitwise blog and contains US zip codes and population from the 2010 Census.

In [135]:
census_URL = "https://s3.amazonaws.com/SplitwiseBlogJB/2010+Census+Population+By+Zipcode+(ZCTA).csv"
census_df = pd.read_csv(census_URL, dtype={'Zip Code ZCTA': str})
census_df.head(10)

Unnamed: 0,Zip Code ZCTA,2010 Census Population
0,1001,16769
1,1002,29049
2,1003,10372
3,1005,5079
4,1007,14649
5,1008,1263
6,1009,741
7,1010,3609
8,1011,1370
9,1012,661


In [136]:
# Some zip codes are listed twice. Checked all zips that start with '100', '10', '11'. No duplicates.
census_df[census_df['Zip Code ZCTA'].str.startswith('11')]['Zip Code ZCTA'].value_counts();

In [137]:
# Write to file
census_df.to_pickle('census_df.pkl')

## Zip codes for NYC boroughs

This dataset will be needed to match zip codes from the Census dataset with its NYC borough. This is being read from an html table from an NYC enthusiast site. The online formatting will require some transformation here.

In [138]:
zips_URL = "https://www.nycbynatives.com/nyc_info/new_york_city_zip_codes.php"
zips_df = pd.read_html(zips_URL, match='10001')[0]
print('{} rows x {} columns'.format(zips_df.shape[0], zips_df.shape[1]))
zips_df.head(20)

240 rows x 5 columns


Unnamed: 0,0,1,2,3,4
0,10001,Manhattan,,10451,Bronx
1,10002,Manhattan,,10452,Bronx
2,10003,Manhattan,,10453,Bronx
3,10004,Manhattan,,10454,Bronx
4,10005,Manhattan,,10455,Bronx
5,10006,Manhattan,,10456,Bronx
6,10007,Manhattan,,10457,Bronx
7,10009,Manhattan,,10458,Bronx
8,10010,Manhattan,,10459,Bronx
9,10011,Manhattan,,10460,Bronx


In [139]:
# The table is displayed twice in different order. Only need one side.
zips_df.drop(labels=[2,3,4], axis='columns', inplace=True)
print('{} rows x {} columns'.format(zips_df.shape[0], zips_df.shape[1]))
zips_df.head(20)

240 rows x 2 columns


Unnamed: 0,0,1
0,10001,Manhattan
1,10002,Manhattan
2,10003,Manhattan
3,10004,Manhattan
4,10005,Manhattan
5,10006,Manhattan
6,10007,Manhattan
7,10009,Manhattan
8,10010,Manhattan
9,10011,Manhattan


In [140]:
# Rename columns and check borough values
zips_df.set_axis(['zip_code', 'borough'], axis='columns', inplace=True)
zips_df['borough'].value_counts()

Manhattan    96
Queens       63
Brooklyn     43
Bronx        25
Staten       13
Name: borough, dtype: int64

In [141]:
# Will convert to uppercase and fix 'Staten' for consistency with previous datasets
zips_df['borough'] = zips_df['borough'].str.upper()
zips_df.loc[zips_df['borough'] == 'STATEN', 'borough'] = 'STATEN ISLAND'

# Convert zip code dtype to string
zips_df['zip_code'] = zips_df['zip_code'].astype(str)

In [142]:
#  Write to file
zips_df.to_pickle('zips_df.pkl')

## Revisit 311 NYC Data for cross correlation

Some records in the 311 NYC database have a zip code but an 'Unspecified' borough. These zip codes can easily be used to find the matching borough using the previous dataset (zip codes for NYC boroughs). This will be done by merging the two databases and reassigning *only* the 'Unspecified' fields if there is a zip code match.

Another option, which I also tried out (not shown), is to assume the NYC 311 zip code field *takes precedence* over the borough if there is a mismatch. In this case, I found that several boroughs were overreported, namely Bronx and Manhattan. This may be a valid option, but I'll opt to preserve the original database values for this project.

In [143]:
# Temporarily rename column so there is no conflict in the merge
zips_df = zips_df.rename(columns={'borough': 'nycborough'})

m_df = nyc1_df.merge(zips_df, left_on='incident_zip', right_on='zip_code', how='left')

m_df.loc[(m_df['incident_zip'] == m_df['zip_code']) & \
         (m_df['borough'] == 'Unspecified'), 'borough'] = m_df['nycborough']

del m_df['zip_code']
del m_df['nycborough']

print('BEFORE')
print(nyc1_df['borough'].value_counts(), '\n')

print('AFTER')
print(m_df['borough'].value_counts(), '\n')


print("The original data had {} missing borough fields, {} of which had an accompanying zip code."
      .format(bor_missing, zc_without_bor))

final_bor_missing = len(m_df[m_df['borough'] == 'Unspecified'])
print("The corrected data has {} missing borough fields, an improvement of {} fields."
      .format(final_bor_missing, bor_missing-final_bor_missing))

nyc1_df = m_df

BEFORE
BROOKLYN         771324
QUEENS           589983
MANHATTAN        480351
BRONX            450933
STATEN ISLAND    127141
Unspecified       41476
Name: borough, dtype: int64 

AFTER
BROOKLYN         771659
QUEENS           590318
MANHATTAN        480606
BRONX            451039
STATEN ISLAND    129485
Unspecified       38101
Name: borough, dtype: int64 

The original data had 41476 missing borough fields, 4440 of which had an accompanying zip code.
The corrected data has 38101 missing borough fields, an improvement of 3375 fields.


In [144]:
# Write cleaned up NYC dataframe to separate file 
nyc1_df.to_pickle('nyc2_df.pkl')