# What Is the Relationship between Housing Characteristics and Complaints?

The goal of this exercise is to find the answer to the Question 3 of the problem statement: 

> Does the Complaint Type that you identified in response to Question 1 have an obvious relationship with any particular characteristic or characteristic of the Houses?

In this exercise, we shall use the 311 dataset in combination with the PLUTO data set. The latter shall be used for the most problematic borough that was identified as a part of answering the Question 2. 
Remember that, the answer to Question 1 (What Is the Top Complaint Type?) was: 

> The most often reported complaint is 'HEAT/HOT WATER'

The answe to Question 2 (What Areas Should the Agency Focus On?) was:

> The borough with the most 'HEAT/HOT WATER' complaints is 'BRONX'

Therefore, in the remainder we shall analyze whether the 'HEAT/HOT WATER' comlaints reported in 'BRONX' have obvious relationship with any particular house characteristics.

# The data sets
The 311 dataset is already well known to us as it was used to answer Questions 1 & 2, therefore it does not require any special introduction. 

The PLUTO data set is new to us and it aggregates condominium unit tax lot informationto the billing lot.
The initially recommended (by the course advisers) PLUTO data set fields to consider are:

|    Field   |                  Description                     |
|------------|--------------------------------------------------|
| Address    | An address of the tax lot |
| BldgArea   | The total gross area in square feet |
| BldgDepth  | The building’s depth, measured in feet |
| BuiltFAR   | The build floor area ration |
| CommFAR    | The maximum allowable commercial floor area ratio |
| FacilFAR   | The maximum allowable community facility floor area ratio |
| Lot        | The one to four-digit tax lot number |
| LotArea    | Total area of the tax lot, in square feet |
| LotDepth   | The tax lot's depth measured in feet |
| NumBldgs   | The number of buildings on the tax lot |
| NumFloors  | The number of full and partialstories starting from the ground floor, for the tallest building on the tax lot |
| OfficeArea | An estimate of theexterior dimensions of the portion of the structure(s) allocated for office use |
| ResArea    | An estimate of the exterior dimensions of the portion of the structure(s) allocated for residential use |
| ResidFAR   | The maximum allowable residential floor area ratio |
| RetailArea | An estimate of the exterior dimensions of the portion of the structure(s) allocated for retail use |
| YearBuilt  | The year construction of the building was completed |
| YearAlter1 | Is the year of the building's most recent alteration |
| ZipCode    | A ZIP code that is valid for one of the addresses assigned to the tax lot |
| XCoord     | The X coordinate of the XY coordinate pair which depicts the approximate location of the lot |
| YCoord     | The Y coordinate of the XY coordinate pair which depicts the approximate location of the lot |

Consider reading the [PLUTO Data Dictionary](https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/pluto_datadictionary.pdf?r=19v2) for more details. The data set archive consists of several CSV files, each devoted to a single borough:

|   CSV file  |      Borough       |
|-------------|--------------------|
| QN_18v1.csv | QUEENS |
| BK_18v1.csv | BROOKLYN |
| SI_18v1.csv | STATEN ISLAND |
| BX_18v1.csv | BRONX |
| MN_18v1.csv | MANHATTAN |

Since we are interested in borough *'BRONX'* we shall use the data from the corresponding *'BX_18v1.csv'* file.


# Load the data


Loading of the data can be done both from the IBM cloud storage and the locally present CSV files. The latter is decided upon the presence of the proper secure field values of the credentials:

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ibm_boto3

from botocore.client import Config

In [2]:
# @hidden_cell
SECURITY_DUMMY = '----------------'
erm2_nwe9_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'erm2_nwe9.csv'
}
bx_18v1_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'BX_18v1.csv'
}

In [3]:
# Allows to get the data source for the credentials from the IBM cloud or local csv file 
def get_data_source(credentials) :
    '''Creates a data source from the IBM cloud or local csv file according to the credentials'''
    # Here we check if the credentials are present, if not try 
    # load the local file if they are then read from the cloud.
    if credentials.get('IAM_SERVICE_ID') == SECURITY_DUMMY :
        # This is the alternative to get the code run locally with a local csv file
        body = 'data' + os.path.sep + credentials.get('FILE')
    else :
        client = ibm_boto3.client(
            service_name = 's3',
            ibm_api_key_id = credentials.get('IBM_API_KEY_ID'),
            ibm_auth_endpoint = credentials.get('IBM_AUTH_ENDPOINT'),
            config = Config(signature_version='oauth'),
            endpoint_url = credentials.get('ENDPOINT'))

        body = client.get_object(
            Bucket = credentials.get('BUCKET'),
            Key = credentials.get('FILE'))['Body']

        # add missing __iter__ method, so pandas accepts body as file-like object
        def __iter__(self): return 0
        if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

    return body

Further, we shall subsequently load the 311 and PLUTO data sets. Along the way, we will select the necessary columns and check on (, and correct if needed,) the column data types.

## The 311 data set
Here we first load the 311 data set:

In [4]:
# Get the data source for the credentials
dhp_ds = get_data_source(erm2_nwe9_creds)

# Read the CSV file
dhp_df = pd.read_csv(dhp_ds, parse_dates = ['created_date', 'closed_date'])

Next we select the data related to the 'HEAT/HOT WATER' comlaints reported in 'BRONX'.

In [5]:
print('Number of all complaints:', dhp_df.shape[0])
dhp_df = dhp_df[(dhp_df['complaint_type'] == 'HEAT/HOT WATER') & (dhp_df['borough'] == 'BRONX')]
print('Number of \'HEAT/HOT WATER\' complaints in \'BRONX\':', dhp_df.shape[0])

Number of all complaints: 6034470
Number of 'HEAT/HOT WATER' complaints in 'BRONX': 609783


Next lest us realize that the 311 data set will only be user to select the lots from the PLUTO data set which had 'HEAT/HOT WATER' complaints in 'BRONX'. Therefore, we shall only keep the relevant columns here, i.e. the property zip code and address. We shall also rename the columns to match those of PLUTO.

In [6]:
dhp_df = dhp_df[['incident_address', 'incident_zip']]
dhp_df = dhp_df.rename({'incident_address':'Address', 'incident_zip':'ZipCode'}, axis=1)
dhp_df.head()

Unnamed: 0,Address,ZipCode
0,511 EAST 148 STREET,10455.0
7,1275 EDWARD L GRANT HIGHWAY,10452.0
12,152 EAST 171 STREET,10452.0
16,2523 UNIVERSITY AVENUE,10468.0
26,3226 BRONXWOOD AVENUE,10469.0


Finally let us check on the column types:

In [7]:
dhp_df.dtypes

Address     object
ZipCode    float64
dtype: object

The types are in order as the address is a string and the zip code is a float.

## The PLUTO data set
Here we first load the PLUTO data set:

In [8]:
# Get the data source for the credentials
bx_ds = get_data_source(bx_18v1_creds)

# Read the CSV file
bx_df = pd.read_csv(bx_ds, low_memory = False)

Next, we select the recommended fields:

In [9]:
bx_df = bx_df[['Address', 'BldgArea', 'BldgDepth', 'BuiltFAR',
              'CommFAR', 'FacilFAR', 'Lot', 'LotArea', 'LotDepth',
              'NumBldgs', 'NumFloors', 'OfficeArea', 'ResArea',
              'ResidFAR', 'RetailArea', 'YearBuilt', 'YearAlter1',
              'ZipCode', 'YCoord', 'XCoord']]

Further list the types of the columns to check if they are all in order:

In [10]:
bx_df.dtypes

Address        object
BldgArea        int64
BldgDepth     float64
BuiltFAR      float64
CommFAR       float64
FacilFAR      float64
Lot             int64
LotArea         int64
LotDepth      float64
NumBldgs        int64
NumFloors     float64
OfficeArea      int64
ResArea         int64
ResidFAR      float64
RetailArea      int64
YearBuilt       int64
YearAlter1      int64
ZipCode       float64
YCoord        float64
XCoord        float64
dtype: object

All the data frame columns have proper numeric types, except for the Address one which is a string.

# Data Exploration and Cleaning

## The 311 data set

We can now describe the 311 data set to get some insights in the remaining data:

In [11]:
dhp_df.describe(include='all')

Unnamed: 0,Address,ZipCode
count,609782,603797.0
unique,22902,
top,3810 BAILEY AVENUE,
freq,7115,
mean,,10460.695938
std,,6.493728
min,,10451.0
25%,,10456.0
50%,,10460.0
75%,,10467.0


There is no much to see here as the provided statistics is mostly not informative. For the address, we can only tell that there are `609782` non Na/NaN  ones among which `22902` are unique and that the top complaints address is `3810 BAILEY AVENUE` with `7115` comlaints over all the years. For the zip codes the amount of useful information is even less, we can just use the number of non Na/NaN zip codes: `603797`. The latter indicates that there are about `609782 - 603797 = 5985` Na/NaN zip codes. Thereore, let us now explicitly check for the present Na/NaN values:

In [12]:
missing = dhp_df.isna()
print('The number of missing addresses is:', missing.Address.sum(),
      '\nThe number of missing zip codes is:', missing.ZipCode.sum())

The number of missing addresses is: 1 
The number of missing zip codes is: 5986


As one can see there are missing values which we can not easily restore, so let us drop the corresponding rows:

In [13]:
print('The number of complaints including Na/NaN values:', dhp_df.shape[0])
dhp_df.dropna(inplace = True)
print('The number of clean complaints:', dhp_df.shape[0])

The number of complaints including Na/NaN values: 609783
The number of clean complaints: 603797


As one can see, the amount of dropped data is marginal, i.e. it just about `100 - (603797 * 100 /609783) =  0.98`%.

## The PLUTO data set
We can now describe the PLUTO data set to get some insights in the selected data:

In [14]:
bx_df.describe(include='all')

Unnamed: 0,Address,BldgArea,BldgDepth,BuiltFAR,CommFAR,FacilFAR,Lot,LotArea,LotDepth,NumBldgs,NumFloors,OfficeArea,ResArea,ResidFAR,RetailArea,YearBuilt,YearAlter1,ZipCode,YCoord,XCoord
count,89785,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89854.0,89525.0,86595.0,86595.0
unique,87017,,,,,,,,,,,,,,,,,,,
top,SHORE DRIVE,,,,,,,,,,,,,,,,,,,
freq,42,,,,,,,,,,,,,,,,,,,
mean,,8113.609,48.229342,1.107134,0.130644,2.853723,111.493601,10239.04,105.978085,1.184778,2.273265,505.7144,5720.876,1.674844,349.91691,1805.69515,176.591782,10464.280726,249975.676667,1021686.0
std,,65204.39,31.333564,1.799155,0.574606,1.605805,467.387099,305825.2,73.946506,1.929445,1.492908,11966.41,56601.9,1.309456,4911.023897,499.485278,567.142346,7.292127,9778.61412,8599.34
min,,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10451.0,227527.0,1002677.0
25%,,1598.0,35.0,0.55,0.0,2.0,20.0,2188.0,95.0,1.0,2.0,0.0,1152.0,0.9,0.0,1920.0,0.0,10460.0,241918.0,1014310.0
50%,,2226.0,44.67,0.86,0.0,2.0,41.0,2508.0,100.0,1.0,2.0,0.0,1760.0,1.1,0.0,1931.0,0.0,10465.0,248586.0,1023321.0
75%,,3288.0,55.0,1.25,0.0,4.8,73.0,4250.0,102.42,1.0,3.0,0.0,2616.0,2.43,0.0,1960.0,0.0,10469.0,258036.5,1027126.0
