In [1]:
# First, make sure to install these packages before running:
! pip3 install pandas
! pip3 install sodapy



In [2]:
# Second import the required packages
import os
import gc
import pandas as pd

from sodapy import Socrata

# Problem Statement

The people of New Yorker use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. These are their  questions:

* Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?
* Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or street (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?
* Does the Complaint Type that you identified in response to question 1 have an obvious relationship with any particular characteristic or characteristics of the houses or buildings?
* Can a predictive model be built for a future prediction of the possibility of complaints of the type that you have identified in response to question 1?

Your organization has assigned you as the lead data scientist to provide the answers to these questions. You need to work on getting answers to them in this Capstone Project by following the standard approach of data science and machine learning.

# Datasets

You will use two datasets from the Department of Housing Preservation and Development of New York City to address their problems.

## 311 complaint dataset

This dataset is available at https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9. You can download part of this data by using SODA API.

Download only the data that is related to the Department of Housing Preservation and Development. Also, restrict your data to the limited number of fields. Otherwise, your data size will be unnecessarily large, and it might not work in the Watson Studio environment. Too much data can also be very slow to process and analyze.

In [3]:
# The URL for the API endpoint
data_url = 'data.cityofnewyork.us'
# The data set at the API endpoint (311 data in this case)
data_set = 'erm2-nwe9'

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Set the timeout to 60 seconds    
client.timeout = 60

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# First 10.000.000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("erm2-nwe9",
                     content_type = "json",
                     select = "created_date, unique_key, complaint_type, \
                               incident_zip, incident_address, street_name, \
                               address_type, city, resolution_description, \
                               borough, latitude, longitude, closed_date, \
                               location_type, status",
                     where = "Agency = 'HPD'",
                     limit = 10000000)



In [4]:
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
results_df.head()

Unnamed: 0,created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,location_type,status,closed_date
0,2020-02-16T23:44:19.000,45631575,HEAT/HOT WATER,10455,511 EAST 148 STREET,EAST 148 STREET,ADDRESS,BRONX,The following complaint conditions are still o...,BRONX,40.814173861738354,-73.91501987953453,RESIDENTIAL BUILDING,Open,
1,2020-02-16T08:58:39.000,45630120,HEAT/HOT WATER,11218,483 OCEAN PARKWAY,OCEAN PARKWAY,ADDRESS,BROOKLYN,The complaint you filed is a duplicate of a co...,BROOKLYN,40.63683035438464,-73.97285471098033,RESIDENTIAL BUILDING,Open,
2,2020-02-16T15:33:26.000,45628166,HEAT/HOT WATER,11230,788 EAST 10 STREET,EAST 10 STREET,ADDRESS,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,40.630019060451055,-73.96719768603026,RESIDENTIAL BUILDING,Closed,2020-02-16T18:07:11.000
3,2020-02-16T20:34:31.000,45628140,HEAT/HOT WATER,10033,495 WEST 186 STREET,WEST 186 STREET,ADDRESS,NEW YORK,The following complaint conditions are still o...,MANHATTAN,40.8511048561927,-73.92844448304969,RESIDENTIAL BUILDING,Open,
4,2020-02-16T19:58:18.000,45632880,HEAT/HOT WATER,10030,111 WEST 141 STREET,WEST 141 STREET,ADDRESS,NEW YORK,The complaint you filed is a duplicate of a co...,MANHATTAN,40.8181522391374,-73.93866094335702,RESIDENTIAL BUILDING,Open,


In [5]:
# Change the 'HEATING' complaints into 'HEAT/HOT WATER' complaints
# As according to the data set description the name was changed in 2014
print('The initial number of \'HEATING\' issues:', (results_df['complaint_type'] == 'HEATING').sum())
results_df.loc[results_df['complaint_type'] == 'HEATING', 'complaint_type'] = 'HEAT/HOT WATER'
print('The number of \'HEATING\' issues after conversion:', (results_df['complaint_type'] == 'HEATING').sum())

The initial number of 'HEATING' issues: 887869
The number of 'HEATING' issues after conversion: 0


In [6]:
# Convert the date string to the date type
results_df['created_date'] = results_df['created_date'].astype('datetime64[ns]', errors = 'ignore')
results_df['closed_date'] = results_df['closed_date'].astype('datetime64[ns]', errors = 'ignore')

# Convert the unique key to integer
results_df['unique_key'] = results_df['unique_key'].astype('int64', errors = 'ignore')

# Convert the incident zip to float
results_df['incident_zip'] = results_df['incident_zip'].astype('float64', errors = 'ignore')

# Convert latitude and longitude to floats
results_df['latitude'] = results_df['latitude'].astype('float64', errors = 'ignore')
results_df['longitude'] = results_df['longitude'].astype('float64', errors = 'ignore')

display(results_df.dtypes)

created_date              datetime64[ns]
unique_key                         int64
complaint_type                    object
incident_zip                     float64
incident_address                  object
street_name                       object
address_type                      object
city                              object
resolution_description            object
borough                           object
latitude                         float64
longitude                        float64
location_type                     object
status                            object
closed_date               datetime64[ns]
dtype: object

In [7]:
# Show the data snippet before storing to file
results_df.head()

Unnamed: 0,created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,location_type,status,closed_date
0,2020-02-16 23:44:19,45631575,HEAT/HOT WATER,10455.0,511 EAST 148 STREET,EAST 148 STREET,ADDRESS,BRONX,The following complaint conditions are still o...,BRONX,40.814174,-73.91502,RESIDENTIAL BUILDING,Open,NaT
1,2020-02-16 08:58:39,45630120,HEAT/HOT WATER,11218.0,483 OCEAN PARKWAY,OCEAN PARKWAY,ADDRESS,BROOKLYN,The complaint you filed is a duplicate of a co...,BROOKLYN,40.63683,-73.972855,RESIDENTIAL BUILDING,Open,NaT
2,2020-02-16 15:33:26,45628166,HEAT/HOT WATER,11230.0,788 EAST 10 STREET,EAST 10 STREET,ADDRESS,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,40.630019,-73.967198,RESIDENTIAL BUILDING,Closed,2020-02-16 18:07:11
3,2020-02-16 20:34:31,45628140,HEAT/HOT WATER,10033.0,495 WEST 186 STREET,WEST 186 STREET,ADDRESS,NEW YORK,The following complaint conditions are still o...,MANHATTAN,40.851105,-73.928444,RESIDENTIAL BUILDING,Open,NaT
4,2020-02-16 19:58:18,45632880,HEAT/HOT WATER,10030.0,111 WEST 141 STREET,WEST 141 STREET,ADDRESS,NEW YORK,The complaint you filed is a duplicate of a co...,MANHATTAN,40.818152,-73.938661,RESIDENTIAL BUILDING,Open,NaT


In [8]:
# Store the results to the data frame
results_df.to_csv('data' + os.path.sep + 'erm2_nwe9.csv', index = False)

In [9]:
# Clean memory
results = []
results_df = []
gc.collect()

60

## PLUTO dataset for housing

This dataset for housing can be accessed from https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/xuk2-nczf. After you download the data, use only the part that is specific to the borough that you are interested in based on your analysis.

In [10]:
# Download the NYC PLUTO Dataset
!wget -O nyc_pluto_18v1.zip https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_18v1.zip

# Unpack the data
!unzip -o -j nyc_pluto_18v1.zip -d data/

# Remove the archive
!rm -f nyc_pluto_18v1.zip

--2020-02-19 09:25:39--  https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_18v1.zip
Resolving www1.nyc.gov... 23.32.9.194
Connecting to www1.nyc.gov|23.32.9.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48263311 (46M) [application/zip]
Saving to: 'nyc_pluto_18v1.zip'


2020-02-19 09:25:41 (36.2 MB/s) - 'nyc_pluto_18v1.zip' saved [48263311/48263311]

Archive:  nyc_pluto_18v1.zip
  inflating: data/BK_18v1.csv        
  inflating: data/BX_18v1.csv        
  inflating: data/MN_18v1.csv        
  inflating: data/PLUTODD18v1.pdf    
  inflating: data/PlutoReadme18v1.pdf  
  inflating: data/QN_18v1.csv        
  inflating: data/SI_18v1.csv        
