# NYC 311 Complaints 2017

For more information on this dataset, see the following articles:
* The _Wired_ magazine article ["What a Hundred Million Calls to 311 Reveal About New York."](https://www.wired.com/2010/11/ff_311_new_york/).
* Ariel White and Kris-Stella Trump's research paper ["The Promises and Pitfalls of 311 Data"](https://arwhite.mit.edu/sites/default/files/images/White%20Trump%20-%20Promises%20Pitfalls%20311%20Data%20-%20UAR%202017.pdf) in [_Urban Affairs Review_](https://journals.sagepub.com/doi/abs/10.1177/1078087416673202).

I used the White and Trump paper, in paritcular, when the making data-cleaning assumptions (e.g., what to clean vs. what to remove) for this project.

In [2]:
import os
from time import time
import requests
from bs4 import BeautifulSoup
from requests import HTTPError
import numpy as np
import pandas as pd
from sodapy import Socrata

## Data Sourcing

The following data sources were used:

* [NYC 311 Complaints](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)
* [2010 Census Population by ZIP Code](https://blog.splitwise.com/2013/09/18/the-2010-us-census-population-by-zip-code-totally-free/)
* [NYC Neighborhod ZIP Codes](https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm)

Inspection of the data revealed non-NYC cities and ZIP codes. Thus, as the focus of this project is on NYC-related complaints, complaints associated with non-NYC locations are removed from the dataset.

### 311 Complaints

In [5]:
!pip install pyarrow
complaints = pd.read_parquet('data/nyc-311-complaints.parquet.gzip')

Collecting pyarrow
[?25l  Downloading https://files.pythonhosted.org/packages/36/94/23135312f97b20d6457294606fb70fad43ef93b7bffe567088ebe3623703/pyarrow-0.11.1-cp36-cp36m-manylinux1_x86_64.whl (11.6MB)
[K    100% |████████████████████████████████| 11.6MB 2.9MB/s ta 0:00:011   59% |███████████████████▏            | 6.9MB 4.5MB/s eta 0:00:02    79% |█████████████████████████▌      | 9.2MB 5.8MB/s eta 0:00:01    99% |███████████████████████████████▊| 11.5MB 5.9MB/s eta 0:00:01
[?25hCollecting numpy>=1.14 (from pyarrow)
[?25l  Downloading https://files.pythonhosted.org/packages/7b/74/54c5f9bb9bd4dae27a61ec1b39076a39d359b3fb7ba15da79ef23858a9d8/numpy-1.16.0-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
[K    100% |████████████████████████████████| 17.3MB 2.4MB/s ta 0:00:0110% |▏                               | 61kB 1.3MB/s eta 0:00:14    21% |███████                         | 3.7MB 8.6MB/s eta 0:00:02    25% |████████                        | 4.4MB 6.4MB/s eta 0:00:03    26% |████████▋    

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

#### Progamatically Sourcing the Data

The [`sodapy`](https://github.com/xmunoz/sodapy) package used to interface with the [Socrata Open Data (SODA) API](https://dev.socrata.com/) and return the raw 311 complaints data in JSON format. Let's define two (perhaps unnecessary) utility functions, one to format the fields to use in the SoQL query, and another to safely pull the results.


In [220]:
def format_fields(*fields):
    """ Formats arbitrary nubmer of fields for the SoQL query """
    return ','.join(fields)


def get_complaints_data(*fields, year=2017, limit=20000000):
    """ Pull certain `fields` for a given `year` from NYC Open Data's
    SODA API """
    try:
        with Socrata('data.cityofnewyork.us', None) as client:
            results = client.get(
                'fhrw-4uyv',
                content_type='json',
                select=format_fields(*fields),
                where=f"created_date between '{year}-01-01T00:00:00' and '{year}-12-31T23:59:59'",
                limit=limit
            )
        return results
    except HTTPError as e:
        print(e)

Now let's use these functions to get our hands on the 311 data for 2017! Note that we're not making a request with an app token, so we'll get a warning that we'll safely ignore for the purposes of this exercise.

In [221]:
r = get_complaints_data(
    'unique_key',
    'created_date',
    'complaint_type',
    'descriptor',
    'borough',
    'city',
    'incident_zip',
    'incident_address',
    'latitude',
    'longitude'
)



In [222]:
complaints = pd.DataFrame.from_records(r)

Converting to datetime takes awhile, so a TODO is to find a faster way to do this.

In [223]:
complaints.sample(10)

Unnamed: 0,borough,city,complaint_type,created_date,descriptor,incident_address,incident_zip,latitude,longitude,unique_key
300941,BROOKLYN,BROOKLYN,Noise - Residential,2017-02-20T20:39:10.000,Banging/Pounding,111 15 STREET,11215,40.66737453930734,-73.99303574282729,35533959
1119522,BROOKLYN,BROOKLYN,Sidewalk Condition,2017-07-06T10:56:50.000,Sidewalk Violation,1741 EAST 10 STREET,11223,40.606585424593824,-73.96279027454561,36625748
1510923,BRONX,BRONX,Noise - Residential,2017-09-09T22:50:12.000,Loud Music/Party,2609 BRIGGS AVENUE,10458,40.86423852855778,-73.89327458492565,37145345
131938,BROOKLYN,BROOKLYN,Sanitation Condition,2017-01-23T08:45:00.000,12 Dead Animals,780 SACKMAN STREET,11212,40.65822031387889,-73.90197312956957,35310755
1499879,QUEENS,OZONE PARK,Blocked Driveway,2017-09-08T02:32:16.000,Partial Access,107-40 93 STREET,11417,40.67837847603292,-73.84501602403132,37131850
2260346,QUEENS,WOODSIDE,Non-Emergency Police Matter,2017-07-06T08:35:03.000,Other (complaint details),31 AVENUE,11377,,,36628548
1459766,QUEENS,Jamaica,Rodent,2017-09-01T00:00:00.000,Mouse Sighting,138-15 97 AVENUE,11435,40.69562517597251,-73.81097646863442,37079791
1678734,STATEN ISLAND,STATEN ISLAND,Derelict Vehicle,2017-10-06T08:54:14.000,With License Plate,23 BROAD STREET,10304,40.62548605410505,-74.07619851651148,37363702
249526,MANHATTAN,NEW YORK,HEAT/HOT WATER,2017-02-10T07:09:24.000,ENTIRE BUILDING,331 EAST 75 STREET,10021,40.770396401558905,-73.95607391347215,35468485
1035253,BRONX,BRONX,Water System,2017-06-22T15:43:00.000,Hydrant Running Full (WA4),,10452,40.84161911182489,-73.91246938513645,36513272


Note that city is _often_ neighborhood, but borough is also logged here as well.

In [224]:
complaints = complaints.replace('Unspecified', np.nan)

In [230]:
complaints[complaints['borough'].isnull()].sample(10)

Unnamed: 0,borough,city,complaint_type,created_date,descriptor,incident_address,incident_zip,latitude,longitude,unique_key
2142466,,STATEN ISLAND,Illegal Parking,2017-12-27T21:11:15.000,Commercial Overnight Parking,858 SINCLAIR AVENUE,10309.0,40.53964989611163,-74.20467535164622,38030230
2224763,,,Benefit Card Replacement,2017-04-20T15:49:53.000,Food Stamp,,,,,35987809
2287970,,,HPD Literature Request,2017-09-13T10:37:02.000,Home Ownership Kit,,,,,37174978
2228978,,,HPD Literature Request,2017-05-01T19:45:52.000,The ABCs of Housing - Spanish,,,,,36079962
2228952,,,Benefit Card Replacement,2017-05-01T14:59:55.000,Food Stamp,,,,,36080194
2186146,,,Benefit Card Replacement,2017-01-19T11:57:22.000,Food Stamp,,,,,35284128
2217500,,,DOF Parking - Request Copy,2017-04-03T11:28:26.000,Image of Ticket,,,,,35861479
2290802,,,Benefit Card Replacement,2017-09-19T10:23:24.000,Medicaid,,,,,37228348
2221824,,,Forms,2017-04-13T22:53:03.000,Office of Preventive Technical Assistance/OPTA,,,,,35935579
2345542,,,DOF Parking - Payment Issue,2017-10-30T13:46:19.000,Applied to Wrong Ticket,,,,,37558092


Let's just add a small sanity check to make sure we only have data from 2017...

In [83]:
assert complaints.created_date.min().year == 2017
assert complaints.created_date.max().year == 2017

#### Investigating Missing Values

First, let's look at the percentage of missing values in each column.

In [165]:
complaints.apply(lambda x: round(100 * x.isnull().mean(), 2))

borough              1.69
city                 4.79
complaint_type       0.00
created_date         0.00
descriptor           0.01
incident_address    17.26
incident_zip         4.78
latitude             7.77
longitude            7.77
park_borough         1.69
unique_key           0.00
dtype: float64

In [70]:
complaints.borough.value_counts()

BROOKLYN         771324
QUEENS           589976
MANHATTAN        480335
BRONX            450933
STATEN ISLAND    127137
Unspecified       41482
Name: borough, dtype: int64

In [71]:
complaints.query("borough == 'Unspecified'").isnull().mean()

borough           0.000000
city              0.890362
complaint_type    0.000000
created_date      0.000000
descriptor        0.000169
incident_zip      0.890507
unique_key        0.000000
dtype: float64

Given that we have unspecified boroughs and our analysis is borough-based, for the present analysis we can simply replace the non-borough cities with missing values and fill in any null borough values with any bourgh values logged in the `city` field, and then remove the now redundant `city` field.

In [232]:
complaints['city'].isin(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'])
complaints.loc[is_borough, 'city'] = np.nan
complaints.loc[:, 'borough'] = (
    complaints.loc[:, 'borough']
              .combine_first(complaints['city'])
)
complaints.drop('city', axis=1, inplace=True)
complaints.sample(10)

Unnamed: 0,borough,complaint_type,created_date,descriptor,incident_address,incident_zip,latitude,longitude,unique_key
509068,QUEENS,FLOORING/STAIRS,2017-03-27T12:39:00.000,FLOOR,26-49 96 STREET,11369,40.76174930651036,-73.87307593572716,35807782
1577903,BROOKLYN,Derelict Vehicle,2017-09-20T07:54:15.000,With License Plate,1870 STUART STREET,11229,40.607473397008086,-73.94011862526952,37232906
1278950,BROOKLYN,Sanitation Condition,2017-08-01T11:05:00.000,22 Weeds,7123 10 AVENUE,11228,40.626397263728926,-74.01183412979289,36840947
695392,QUEENS,Sidewalk Condition,2017-04-28T10:15:40.000,Sidewalk Violation,20-35 PARSONS BOULEVARD,11357,40.7812470703231,-73.82253008738705,36056987
177622,QUEENS,Sidewalk Condition,2017-01-30T09:24:53.000,Sidewalk Violation,95-26 78 STREET,11416,40.68260635973578,-73.86076951574849,35373042
1818657,BRONX,HEAT/HOT WATER,2017-10-30T19:47:47.000,ENTIRE BUILDING,2625 GRAND CONCOURSE,10468,40.86515417644589,-73.89508805180566,37556951
2343553,BRONX,DOF Property - Payment Issue,2017-09-12T10:43:55.000,Other Billing Issue,,10462,,,37162674
2331890,MANHATTAN,Maintenance or Facility,2017-07-28T11:53:29.000,Structure - Outdoors,,10023,,,36811161
880060,BRONX,Noise - Residential,2017-05-28T02:21:05.000,Loud Music/Party,WEST 197 STREET,10468,40.872050093519945,-73.89956436195631,36302339
1419185,BROOKLYN,General Construction/Plumbing,2017-08-25T16:10:18.000,Site Conditions Endangering Workers,295 DEGRAW STREET,11231,40.683516644455786,-73.99462050365543,37026756


### Population Data by ZIP Code

Note that this dataset technically contains [ZCTA (ZIP Code Tabulation Area)](https://www.census.gov/geo/reference/zctas.html) codes, not ZIP codes. However, they are treated as equivalent for the purpose of this exercise.

In [177]:
def get_population_by_zip(url):
    """ Gets 2010 Census population by ZIP code data """
    try:
        population_by_zip = pd.read_csv(url)
        return population_by_zip
    except HTTPError as e:
        print(e)

In [178]:
population_url = ('https://s3.amazonaws.com/SplitwiseBlogJB/'
                         '2010+Census+Population+By+Zipcode+(ZCTA).csv')

population_by_zip = get_population_by_zip(population_url)

In [186]:
population_by_zip.columns = 'zip_code population'.split(' ')
population_by_zip.head(10)

Unnamed: 0,zip_code,population
0,1001,16769
1,1002,29049
2,1003,10372
3,1005,5079
4,1007,14649
5,1008,1263
6,1009,741
7,1010,3609
8,1011,1370
9,1012,661


### NYC ZIP Code Data

As mentioned above, brief inspection of the 311 complaints dataset revealed cities that are not in NYC. Thus, as part of my preprocessing I filtered out these observations.

Data on the ZIP codes associated with NYC neighborhoods comes from the New York State Department of Health. The HTML table this data is stored in is not in the most convenient format for data analysis, so the data must be scraped and turned into ["tidy"](https://r4ds.had.co.nz/tidy-data.html) format (i.e., basically a statistician's analogue of Codd's [3NF](https://en.wikipedia.org/wiki/Third_normal_form) familiar to data engineers).

First, let's create two utility functions, the first to scrape the table from the Department of Health website, and the second to "tidy" the data into a minimal table of borough by ZIP code data.

In [197]:
def scrape_nyc_zips(url):
    """ Scrapes table of NYC zipcodes from NYC Department of Health """
    try:
        r = requests.get(url)
        return r
    except HTTPError as e:
        print("NYC neighborhood ZIP code lookup table not found:", e)


# TODO: Refactor to have utility functions for, for example, the
# "tidying" aspects and the conversion aspects ... and rename this
# function to something more sensible
def tidy_nyc_zips(html):
    """ Wrangle HTML table of NYC ZIP codes into a "tidy" data frame

    Args:
        html (requests.models.Response):

    Returns:
        pandas.DataFrame:
    """

    # TODO: This seems too ugly and hacky so find a more elegant
    # solution
    borough_zips = (
        pd.read_html(html.content, header=0)[0]
          .reset_index()
    )

    borough_zips.loc[borough_zips['ZIP Codes'].isnull(), 'Borough'] = np.nan
    borough_zips.loc[:, 'ZIP Codes'] = \
        borough_zips.loc[:, 'ZIP Codes'].str.replace(' ', '')

    borough_zips.loc[:, 'ZIP Codes'] = (
        borough_zips.loc[:, 'ZIP Codes']
                    .combine_first(borough_zips['Neighborhood'])
    )

    # TODO: keep the neighborhood information, even though it's not
    # currently necessary for this analysis
    borough_zips.drop('Neighborhood', axis=1, inplace=True)
    borough_zips.loc[:, 'Borough'] = \
        borough_zips.loc[:, 'Borough'].ffill()

    # Overwrite the comma-separated string "list" in the cell
    # with an actual list of integers
    borough_zips.loc[:, 'ZIP Codes'] = (
        borough_zips.loc[:, 'ZIP Codes']
                    .apply(lambda x: x.split(','))
    )

    # TODO: Write utility function for this pattern
    borough_zips = (
        borough_zips.set_index(['index', 'Borough'])
                    .loc[:, 'ZIP Codes']
                    .apply(pd.Series) # Expand the list of 
                    .stack()
                    .reset_index()
    )

    borough_zips.drop(['index', 'level_2'], axis=1, inplace=True)
    borough_zips.columns = \
        'borough zip_code'.split(' ')
    borough_zips.loc[:, 'zip_code'] = \
        borough_zips.loc[:, 'zip_code'].astype(int)

    return borough_zips

In [198]:
nyc_zips_url = ('https://www.health.ny.gov/statistics/cancer/registry/appendix/'
                'neighborhoods.htm')

html = scrape_nyc_zips(nyc_zips_url)
nyc_zips = tidy_nyc_zips(html)

nyc_zips.head(10)

Unnamed: 0,borough,zip_code
0,Bronx,10453
1,Bronx,10457
2,Bronx,10460
3,Bronx,10458
4,Bronx,10467
5,Bronx,10468
6,Bronx,10451
7,Bronx,10452
8,Bronx,10456
9,Bronx,10454


In [199]:
population_by_zip_nyc = nyc_zips.merge(
    population_by_zip,
    on='zip_code',
    how='left'
)

In [200]:
population_by_zip_nyc.head(10)

Unnamed: 0,borough,zip_code,population
0,Bronx,10453,78309.0
1,Bronx,10457,70496.0
2,Bronx,10460,57311.0
3,Bronx,10458,79492.0
4,Bronx,10467,97060.0
5,Bronx,10468,76103.0
6,Bronx,10451,45713.0
7,Bronx,10452,75371.0
8,Bronx,10456,86547.0
9,Bronx,10454,37337.0


In [201]:
population_by_zip_nyc[population_by_zip_nyc['population'].isnull()]

Unnamed: 0,borough,zip_code,population
140,Queens,11695,


Note that ZIP code 11695 is (according to Google) Far Rockaway.

## Data Cleaning

Given the size of the dataset (viz., about 19.5 million observations), string values are converted to `pandas` [categorical](https://pandas.pydata.org/pandas-docs/stable/categorical.html) variables, which are internally stored as integers and thus cut down on memory usage when slicing and dicing the data frame.

### Data Type Conversion

The main idea here is to cut down on the number of variables stored as text in order to decrease the memory used by `pandas`, which at the start is as follows:

In [251]:
complaints.memory_usage()

Index               18623960
complaint_type       4668006
created_date        18623960
descriptor          18623960
incident_address    18623960
incident_zip        18623960
latitude            18623960
longitude           18623960
unique_key          18623960
zip_code_x          18623960
population_x        18623960
borough             18623960
zip_code_y          18623960
population_y        18623960
dtype: int64

In [233]:
complaints.loc[:, 'unique_key'] = \
    complaints.loc[:, 'unique_key'].astype(int)

complaints.loc[complaints['borough'].eq('Unspecified'), 'borough'] = None

complaints.loc[:, 'borough'] = \
    complaints.loc[:, 'borough'].astype('category')

complaints.loc[:, 'complaint_type'] = \
    complaints.loc[:, 'complaint_type'].astype('category')

complaints.loc[:, 'created_date'] = \
    complaints.loc[:, 'created_date'].apply(pd.to_datetime)

In [146]:
complaints.memory_usage()
complaints.dtypes

borough                     object
city                        object
complaint_type              object
created_date        datetime64[ns]
descriptor                  object
incident_address            object
incident_zip                object
latitude                    object
longitude                   object
park_borough                object
unique_key                  object
dtype: object

In [238]:
complaints.loc[:, 'incident_zip'] = \
    complaints.loc[:, 'incident_zip'].apply(pd.to_numeric, errors='coerce')

### Imputing Values for "Unspecified" Boroughs

Brief exploration of the 311 complaints dataset revels that borough is missing for many incidents associated with Queens. For example, a value of `Unspecified` shows up for Forest Hills, Hollis, and other neighborhoods in Queens.

In [104]:
is_unspecified = complaints['borough'].eq('Unspecified')
complaints.loc[is_unspecified, 'borough'] = np.nan

In [142]:
complaints[complaints['borough'].isnull() & complaints['city'].isnull()].sample(10)
complaints[complaints['complaint_type'].str.contains('request', case=False)].complaint_type.value_counts()
is_request = complaints['complaint_type'].str.contains('request', case=False)
is_benefit = complaints['complaint_type'].eq('Benefit Card Replacement')
no_requests = complaints[~is_request & ~is_benefit].copy()
# Benefit Card Replacement
# DCA / DOH New License Application Request
# DOF Parking - Payment Issue
# School Maintenance
# Literature Request
# Street Light Condition
# Forms
# Illegal Parking
# DOF Parking - DMV Clearance
# DCA / DOH New License Application Request

In [216]:
no_requests[no_requests['borough'].isnull() & complaints['city'].isnull()].sample(10)

  """Entry point for launching an IPython kernel.


Unnamed: 0,borough,city,complaint_type,created_date,descriptor,incident_address,incident_zip,latitude,longitude,park_borough,unique_key
2301049,,,Street Light Condition,2017-10-13 09:23:00,Lamppost Damaged,WESTSHORE EXPY,,,,Unspecified,37422118
2316586,,,Street Light Condition,2017-11-30 10:37:00,Street Light Out,,,,,Unspecified,37822131
2208396,,,Traffic Signal Condition,2017-03-10 17:29:00,Controller,,,,,Unspecified,35669403
2183282,,,DOF Parking - Payment Issue,2017-01-10 11:26:11,Finance Business Center - Not Reflected,,,,,Unspecified,35215161
2301450,,,DOF Parking - Payment Issue,2017-10-16 15:59:45,Card - No DOF Confirmation Number Issued,,,,,Unspecified,37437230
2255200,,,Street Light Condition,2017-06-23 07:13:00,Street Light Out,,,,,Unspecified,36520692
2230303,,,DOF Parking - Payment Issue,2017-05-02 12:58:16,Status of PV Refund,,,,,Unspecified,36088465
2292615,,,Street Light Condition,2017-09-22 10:14:00,Street Light Out,,,,,Unspecified,37250105
2211037,,,Street Light Condition,2017-03-17 11:39:00,Street Light Out,,,,,Unspecified,35726835
2235627,,,Noise - Residential,2017-05-14 01:24:51,Loud Music/Party,FOREST AVENUE,,,,Unspecified,36181147


In [133]:
complaints['city'].value_counts()
complaints[complaints['borough'].isnull() & complaints['city'].isnull()].park_borough.value_counts()
complaints[complaints['complaint_type'].str.contains('request', case=False)].complaint_type.value_counts()

BROOKLYN               746593
NEW YORK               462594
BRONX                  433107
STATEN ISLAND          127838
Jamaica                 30973
JAMAICA                 27198
FLUSHING                23087
ASTORIA                 21912
Flushing                21629
Astoria                 17956
RIDGEWOOD               17810
Ridgewood               17255
CORONA                  12155
WOODSIDE                11104
FRESH MEADOWS           10691
Far Rockaway            10625
OZONE PARK              10023
Elmhurst                 9581
EAST ELMHURST            9250
ELMHURST                 9154
LONG ISLAND CITY         9063
Ozone Park               8760
Corona                   8750
Woodside                 8638
FOREST HILLS             8359
SOUTH OZONE PARK         8257
SOUTH RICHMOND HILL      8249
South Ozone Park         8237
QUEENS VILLAGE           7785
Queens Village           7661
                        ...  
BROOKYLN                    1
LANCASTER                   1
BRIELLE   

### (Minimal) Text Standardization and Cleaning

`ALL CAPS` values were changed to `Title Case` for the `borough` and `city` fields.

No further cleanup or standardization was attempted `complaint_type` or `descriptor` fields, as the capitalizations and conventions here (e.g., acronmyms) here often appear meaningful.

In [145]:
complaints.loc[:, 'borough'] = complaints.loc[:, 'borough'].str.title()
complaints.loc[:, 'city'] = complaints.loc[:, 'city'].str.title()
complaints.sample(15)

Unnamed: 0,borough,city,complaint_type,created_date,descriptor,incident_address,incident_zip,latitude,longitude,park_borough,unique_key
463499,Bronx,Bronx,Noise - Vehicle,2017-03-20 18:14:51,Car/Truck Music,1254 SHERMAN AVENUE,10456.0,40.83492955386326,-73.9152101945469,BRONX,35748028
1253573,Bronx,Bronx,Noise - Residential,2017-07-29 00:39:17,Loud Music/Party,891 FOX STREET,10459.0,40.81900434693768,-73.89478145224119,BRONX,36808084
1060305,Bronx,Bronx,DOOR/WINDOW,2017-06-26 14:37:00,DOOR,2110 HONEYWELL AVENUE,10460.0,40.8452892476298,-73.88215477341697,BRONX,36546633
1659996,Brooklyn,Brooklyn,Noise - Commercial,2017-10-03 07:45:59,Banging/Pounding,60 BROADWAY,11249.0,40.71058076963279,-73.9664258153253,BROOKLYN,37338587
1848510,Manhattan,New York,Sewer,2017-11-03 06:15:00,Catch Basin Sunken/Damaged/Raised (SC1),WEST 125 STREET,10027.0,40.81046780475128,-73.95187871318936,MANHATTAN,37601485
1173561,Brooklyn,Brooklyn,Noise - Residential,2017-07-16 01:13:51,Loud Music/Party,,11236.0,40.63842811361248,-73.91274672730377,BROOKLYN,36701878
407023,Brooklyn,Brooklyn,HEAT/HOT WATER,2017-03-12 15:24:21,APARTMENT ONLY,1904 NOSTRAND AVENUE,11226.0,40.639220898441096,-73.94834139005543,BROOKLYN,35676499
345259,Bronx,Bronx,Unsanitary Animal Pvt Property,2017-02-28 00:00:00,Dog,3135 COUNTRY CLUB ROAD,10465.0,40.84251215227546,-73.82433999222908,BRONX,35593623
528547,Brooklyn,Brooklyn,Derelict Vehicles,2017-03-30 19:59:00,14 Derelict Vehicles,885 STERLING PLACE,11216.0,40.67247702745193,-73.9491411685897,BROOKLYN,35836178
2192170,Brooklyn,,Street Condition,2017-01-31 16:36:28,"Rough, Pitted or Cracked Roads",WILLOUGHBY AVENUE,,,,BROOKLYN,35379099


In [154]:
complaints.isnull().mean()

borough_x           0.016854
city                0.047876
complaint_type      0.000000
created_date        0.000000
descriptor          0.000082
incident_address    0.172558
incident_zip        0.047884
latitude            0.077693
longitude           0.077693
park_borough        0.000000
unique_key          0.000000
borough_y           0.054117
zipcode             0.054117
dtype: float64

Now let's get the last of the "Unspecified" boroughs that we can (given the minimal cleanup and data sourcing that we've done).

In [245]:
nrow_start = complaints.shape[0]
complaints = complaints.merge(
    population_by_zip_nyc,
    left_on='incident_zip',
    right_on='zip_code',
    how='left'
)
assert complaints.shape[0] == nrow_start

In [249]:
complaints.loc[:, 'borough'] = (
    complaints.loc[:, 'borough_x']
              .combine_first(complaints['borough_y'])
)
complaints.drop(['borough_x', 'borough_y'], axis=1, inplace=True)

In [250]:
complaints.isnull().mean()

complaint_type      0.000000
created_date        0.000000
descriptor          0.000079
incident_address    0.147200
incident_zip        0.000000
latitude            0.030967
longitude           0.030967
unique_key          0.000000
zip_code_x          0.000000
population_x        0.000011
borough             0.000000
zip_code_y          0.000000
population_y        0.000011
dtype: float64

In [157]:
is_borough = complaints['city'].isin(['Brooklyn', 'Queens', 'Staten Island', 'Manhattan', 'Bronx'])
complaints.loc[~is_borough, 'complaints'] = np.nan

complaints.loc[:, 'borough'] = (
    complaints.loc[:, 'borough']
              .combine_first(complaints['city'])
)

complaints.loc[:, 'borough'] = (
    complaints.loc[:, 'borough_x']
              .combine_first(complaints['borough_y'])
)

### Removing Non-NYC Observations

Given that the questions in this exercise deal with NYC, any complaint associated with a ZIP code outside of the city's boundaries is removed.

<img src="geocode.png" alt="Drawing" style="width: 400px;"/>

In [None]:
is_nyc = complaints['zipcode'].isin(nyc_zips['zipcode'])
complaints = complaints.loc[is_nyc, :]

There's much more I'd like to do to clean up the dataset and impute missing values ... but you've got to stop somewhere.

### Wrapping Up

The cleaned version of the dataset is then saved for later analysis using the columnar [Apache Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) format, which is significantly faster to read than the corresponding CSV.

In [None]:
complaints.to_parquet('data/nyc-311-complaints.parquet.gzip')