# Predict Complaint Types

The goal of this exercise is to do Model Development and Validation to find the answer to the Question 4 of the problem statement:

> Can a predictive model be built for future prediction of the possibility of complaints of the specific type that you identified in response to Question 1?

This exercise will be based on the findings of the previous three exercises. Therefore, we shall use the 311 complaints and the PLUTO data sets to feature-engineer a 'HEAT/HOT WATER' complaints for tax lots dataset. The latter is to be used to build a predictive model to estimate the number of future complaints based on selected house characteristics (which we will also refer to as properties or features).

We shall formalize the question at hand as follows:

> Build a prediction model for the number of 'HEAT/HOT WATER' complaints per year for a house with a selectd set of characteristics.

The rest of the work will be organized as follows, we shall:
1. Load, clean and prepare the data sets
   * In a similar way we did for answering Questions 1 to 3
2. Join the '311 complaint' data set with the PLUTO one
   * In a similar way we did for answering Questions 1 to 3
3. Determine the models to be used
   * This will influence the feature selection process
4. Perform house feature selection
   * This will be re-done as we used a different model for Question 3
5. Perform model training
   * Including parameter tuning and cross validation if any
6. Evaluate and compare models
7. Recommending the best performing model

Please note that, the data sets will not be described as the latter has already been done when answering Questions 1 to 3. We shall only repeat that the PLUTO data set will initially be taked with the following set of features:

In [8]:
pluto_features = ['Address', 'BldgArea', 'BldgDepth', 'BuiltFAR',
              'CommFAR', 'FacilFAR', 'Lot', 'LotArea', 'LotDepth',
              'NumBldgs', 'NumFloors', 'OfficeArea', 'ResArea',
              'ResidFAR', 'RetailArea', 'YearBuilt', 'YearAlter1',
              'ZipCode', 'YCoord', 'XCoord']
print('The initial set of PLUTO features to consider:\n', pluto_features)

The initial set of PLUTO features to consider:
 ['Address', 'BldgArea', 'BldgDepth', 'BuiltFAR', 'CommFAR', 'FacilFAR', 'Lot', 'LotArea', 'LotDepth', 'NumBldgs', 'NumFloors', 'OfficeArea', 'ResArea', 'ResidFAR', 'RetailArea', 'YearBuilt', 'YearAlter1', 'ZipCode', 'YCoord', 'XCoord']


# 1. Load, clean, prepare

Loading of the data can be done both from the IBM cloud storage and the locally present CSV files. The latter is decided upon the presence of the proper secure field values of the credentials:

In [12]:
import os
import re
import seaborn
import ibm_boto3
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from botocore.client import Config
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

%matplotlib inline

In [13]:
# @hidden_cell
SECURITY_DUMMY = '----------------'
erm2_nwe9_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'erm2_nwe9.csv'
}
bk_18v1_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'BK_18v1.csv'
}
bx_18v1_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'BX_18v1.csv'
}
mn_18v1_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'MN_18v1.csv'
}
qn_18v1_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'QN_18v1.csv'
}
si_18v1_creds = {
    'IAM_SERVICE_ID'    : SECURITY_DUMMY,
    'IBM_API_KEY_ID'    : SECURITY_DUMMY,
    'ENDPOINT'          : 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT' : 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET'            : SECURITY_DUMMY,
    'FILE'              : 'SI_18v1.csv'
}

In [14]:
# Allows to get the data source for the credentials from the IBM cloud or local csv file 
def get_data_source(credentials) :
    '''Creates a data source from the IBM cloud or local csv file according to the credentials'''
    # Here we check if the credentials are present, if not try 
    # load the local file if they are then read from the cloud.
    if credentials.get('IAM_SERVICE_ID') == SECURITY_DUMMY :
        # This is the alternative to get the code run locally with a local csv file
        body = 'data' + os.path.sep + credentials.get('FILE')
    else :
        client = ibm_boto3.client(
            service_name = 's3',
            ibm_api_key_id = credentials.get('IBM_API_KEY_ID'),
            ibm_auth_endpoint = credentials.get('IBM_AUTH_ENDPOINT'),
            config = Config(signature_version='oauth'),
            endpoint_url = credentials.get('ENDPOINT'))

        body = client.get_object(
            Bucket = credentials.get('BUCKET'),
            Key = credentials.get('FILE'))['Body']

        # add missing __iter__ method, so pandas accepts body as file-like object
        def __iter__(self): return 0
        if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

    return body

Further, we shall subsequently load the 311 and PLUTO data sets. Along the way, we will select the necessary columns and check on (, and correct if needed,) the column data types.

## 311 complaints

Balow we load the data set, and then first select the required complaints along with the needed columns:

In [104]:
# Get the data source for the credentials
dhp_ds = get_data_source(erm2_nwe9_creds)

# Read the CSV file
dhp_df = pd.read_csv(dhp_ds, parse_dates = ['created_date', 'closed_date'])
print('Number of all complaints:', dhp_df.shape[0])

Number of all complaints: 6034470


In [105]:
# Select the 'HEAT/HOT WATER' complaints
dhp_df = dhp_df[(dhp_df['complaint_type'] == 'HEAT/HOT WATER')]
print('Number of \'HEAT/HOT WATER\' complaints is:', dhp_df.shape[0])

Number of 'HEAT/HOT WATER' complaints is: 2159103


In [106]:
# Select the columns that matter, and rename for convenience
dhp_df = dhp_df[['created_date', 'incident_address', 'incident_zip']]
dhp_df = dhp_df.rename({'incident_address':'Address', 'incident_zip':'ZipCode'}, axis=1)

In [113]:
# Convert the address to upper case for uniformity
dhp_df.Address = dhp_df.Address.map(str).map(str.upper)
# Strip the address strings
dhp_df.Address = dhp_df.Address.str.strip()
# Replace sequence of white spaces with one
dhp_df.Address = dhp_df.Address.str.replace('\s+',' ')

In [108]:
# Drop the Na/NaN valued rows
init_size = dhp_df.shape[0]
dhp_df.dropna(inplace = True)
print('Number of rows before dropping Na/NaN:', init_size,', after:', dhp_df.shape[0])

Number of rows before dropping Na/NaN: 2159103 , after: 2140078


Exract the year the complaint was created and then drop the *'created_date'* column.

In [109]:
dhp_df['Year'] = dhp_df.created_date.dt.year
dhp_df.drop(columns = ['created_date'], inplace = True)

Let us summarize the Year statitics so far:

In [122]:
year_descr = dhp_df['Year'].describe().astype(int)
year_descr

count    2140078
mean        2014
std            2
min         2010
25%         2012
50%         2015
75%         2017
max         2020
Name: Year, dtype: int64

As one can see the min/max years range is between `2010` and `2020` which means that there were no missing/wrong *'created_date'* values present.

We will only need the average complaint counts per year for each given *'Address'*/*'ZipCode'* pair:

In [127]:
# Group by the zip code and address to count the complaints
dhp_df = dhp_df.groupby(['ZipCode', 'Address']).size().to_frame()

# Rename the counts column and then compute the average count for the min/max years range
dhp_df.rename({0 : 'AvgCnt'}, axis = 1, inplace = True)
dhp_df.AvgCnt = dhp_df.AvgCnt/(year_descr.loc['max'] - year_descr.loc['min'] + 1)

# Re-set the indexes to turn the Address and ZipCode back into columns
dhp_df.reset_index(level = 1, inplace = True)
dhp_df.reset_index(level = 0, inplace = True)
dhp_df.head()

Unnamed: 0,ZipCode,Address,AvgCnt
0,10001.0,10 WEST 28 STREET,0.090909
1,10001.0,100 WEST 26 STREET,0.090909
2,10001.0,102 WEST 29 STREET,0.090909
3,10001.0,103 WEST 27 STREET,0.090909
4,10001.0,11 WEST 34 STREET,0.090909


Finally, we check on the column types:

In [126]:
dhp_df.dtypes

ZipCode    float64
Address     object
AvgCnt     float64
dtype: object

All the data types are in order: the address is a string and the zip code and average complaint count are floats.

## PLUTO 

# 2. Join data sets

# 3. Model selection

# 4. Feature selection

## 4.1 Univariate Selection

## 4.2 Feature Importance

## 4.3 Feature correlations

## 4.4 Final selection

# 5. Split the data

# 6. Train models

# 7. Evaluate models

## 7.1 Model 1

## 7.2 Model 2

## 7.3 Evaluation summary

# 8. Conclusions