This notebook performs the following tasks:

3.1 [Explore binary features](#3.1)


3.2 [Build classifier](#3.2)

Kush did tasks 3.1-3.2. 

In [7]:
# Data tools
import pandas as pd
import matplotlib.pyplot as plt

# ML tools
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [59]:
train = pd.read_csv('data_train_clean1.csv')
test = pd.read_csv('data_test_clean1.csv')

# 3.1 Explore binary features
<a id='3.1'></a>

A hope is that there's a binary feature that closely maps whether or not a permit is issued. We explore these below. 

First, these are the baseline accuracies computed from the constant model:

In [88]:
baseline_tr = max(sum(train['Issued or not'] == 1), sum(train['Issued or not'] == 0)) / train.shape[0]
baseline_te = max(sum(test['Issued or not'] == 1), sum(test['Issued or not'] == 0)) / test.shape[0]

print(f"Baseline training set accuracy = {baseline_tr}")
print(f"Baseline test set accuracy     = {baseline_te}")

Baseline training set accuracy = 0.9252513826043238
Baseline test set accuracy     = 0.9234288587229764


In [89]:
confusion_matrix(train['Site Permit'], train['Issued or not'])

array([[ 10178, 144636],
       [  1716,   2590]], dtype=int64)

In [90]:
confusion_matrix(train['Structural Notification'], train['Issued or not'])

array([[ 10522, 143049],
       [  1372,   4177]], dtype=int64)

In [91]:
confusion_matrix(train['Fire Only Permit'], train['Issued or not'])

array([[ 11448, 132581],
       [   446,  14645]], dtype=int64)

The binary features above are hardly indicative of issuance. By inspection they're all close to the constant model. 
We believe this is due to the fact that permits aren't unissued because of the underlying building modification, but rather something more technical in the permit process. 

Our hypothesis is that permits are rejected because they don't provide enough technical information. 

In [92]:
missing_estimated_cost = np.array(train['Estimated Cost'].isna())
missing_revised_cost   = np.array(train['Revised Cost'].isna())
missing_existing_type  = np.array(train['Existing Construction Type'].isna())
missing_proposed_type  = np.array(train['Proposed Construction Type'].isna())
missing_existing_units = np.array(train['Existing Units'].isna())
missing_proposed_units = np.array(train['Proposed Units'].isna())
missing_existing_use   = np.array(train['Existing Use'].isna())
missing_proposed_use   = np.array(train['Proposed Use'].isna())

In [94]:
confusion_matrix(missing_revised_cost, train['Issued or not'])

array([[  7098, 147220],
       [  4796,      6]], dtype=int64)

# 3.2 Build classifier
<a id='3.2'></a>

Not all feature spaces that were attempted are shown. These include permit type, zipcode, and various numerical variables. The margin $C$ was hand-tuned on a validation set but made no improvements. 

In [97]:
def features_labels(data):
    missing_revised_cost = np.array(data['Revised Cost'].isna())
    X = missing_revised_cost.reshape(missing_revised_cost.shape + (1,))
    y = np.array(data['Issued or not'])
    return X, y

In [98]:
X_tr, y_tr = features_labels(train)
X_te, y_te = features_labels(test)

In [99]:
clf = SVC(C=1.0, gamma='auto')
clf.fit(X_tr, y_tr)

train_acc = clf.score(X_tr, y_tr) 
test_acc  = clf.score(X_te, y_te)
print(f"Training accuracy = {train_acc}")
print(f"Test accuracy     = {test_acc}")

Training accuracy = 0.9553544494720966
Test accuracy     = 0.955052790346908


The data dictionary doesn't clarify when exactly "Revised Cost" is entered during the construction process. It's not something that's always entered after a permit is issued since we'd see 100% accuracy in that case. Based on this extremely simple model, we can say that permits should be updated with a revised cost during the application process in order to increase the chance of it being issued.  