# Predicting blight ticket compliance in Detroit, MI

The data sets used and the problem that this code attempts to solve were provided as part of a Coursera course $-$ Applied Machine Learning in Python. Given that the assignment was relatively free of constraints $-$ train a model to produce an AUC of 0.7 or better on a previously unseen test set, without using MLPs $-$ I decided to put the code on my personal Github, as I feel this is only one of a great many possible answers and that the problem itself is interesting.

To summarize the problem, blight refers to dilapidated housing or structures that may be unsightly at best and unsafe or abandoned at worst. Blight tickets are intended to hold property owners accountable for maintaining an acceptable level of upkeep by levying a fine at those who allow their property to fall into disrepair. These fines are not always paid, however, which forms the basis of the primary question for this bite-sized project $-$ can a model be trained to accurately predict which individuals will comply in paying the fine? More information about blight tickets can be found [here](https://detroitmi.gov/departments/department-appeals-and-hearings/blight-ticket-information).

The data set used here was pulled directly from Coursera, who in turn obtained all data used from the [Detroit Open Data Portal](https://data.detroitmi.gov/). It contains information about all tickets issued from 2004-2011. Additionally, two other files allow for an easy conversion from text-based addresses to latitudes and longitudes.

<br>

__The file 'train.csv' contains the following fields__ (from Coursera):

_Information about the ticket, those who issued it_:
>ticket_id - unique identifier for tickets<br>
agency_name - Agency that issued the ticket<br>
inspector_name - Name of inspector that issued the ticket<br>

_Information about the violator_:
>violator_name - Name of the person/organization that the ticket was issued to<br>
violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred<br>
mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator<br>

_Relevant dates_:
>ticket_issued_date - Date and time the ticket was issued<br>
hearing_date - Date and time the violator's hearing was scheduled<br>

_Details about the violation and judgment_:
>violation_code, violation_description - Type of violation<br>
disposition - Judgment and judgment type<br>

_Financial details_:
>fine_amount - Violation fine amount, excluding fees<br>
admin_fee - 20 USD fee assigned to responsible judgments<br>
state_fee - 10 USD fee assigned to responsible judgments<br>
late_fee - 10% fee assigned to responsible judgments<br>
discount_amount - discount applied, if any<br>
clean_up_cost - DPW clean-up or graffiti removal cost<br>
judgment_amount - Sum of all fines and fees<br>

_Misc_:
>grafitti_status - Flag for graffiti violations<br>

_Payment and compliance information_:
>payment_amount - Amount paid, if any<br>
payment_date - Date payment was made, if it was received<br>
payment_status - Current payment status as of Feb 1 2017<br>
balance_due - Fines and fees still owed<br>
collection_status - Flag for payments in collections<br>
compliance<br>
 Null = Not responsible<br>
 0 = Responsible, non-compliant<br>
 1 = Responsible, compliant<br>
compliance_detail - More information on why each ticket was marked compliant or non-compliant<br>

<br>
I have elected to use a random forest classifier for this problem as it allows me to easily handle a variety of input types, does not require scaling of the input data, and can easily provide a list of the features that the model considers to be the most important, an attribute that I find useful during feature selection and model refinement. Data cleaning and feature selection decisions will be explained in markdown preceding the relevant cells.

In conclusion, it appears that the most important features are the latitude and longitude of the dwelling. This may suggest that blight tickets are often levied at those who own structures in poorer regions of the city, and are therefore less able to pay; cross referencing with another data set that provides information about the geographical distribution of wealth in the city may provide more insight into this point. Another interesting finding is that two of the most important features are the judgment amount $-$ the total fine levied against the property owner $-$ and the discount amount. The judgment amount may indicate that individuals are unable or simply unwilling to pay larger fees, whereas the discount amount may play a more psychological role in the outcome $-$ if individuals feel as if the fee is reduced, they may be less upset about having received a ticket, and may therefore be more likely to pay it. If the city of Detroit desires to increase compliance, it may be worth decreasing the magnitude of fees levied for each offense or to increase the discount amount. If a smaller fine increases compliance, that could possibly increase revenue despite the decreased revenue from each individual blight ticket.

In [84]:
import pandas as pd
import numpy as np
import datetime

In [67]:
# Read in all of the relevant csv files
train = pd.read_csv('data/train.csv', encoding = "ISO-8859-1", low_memory=False)
addresses = pd.read_csv('data/addresses.csv')
latlons = pd.read_csv('data/latlons.csv')

In [68]:
# Drop NULL compliance values to leave binary classification problem
train.dropna(subset=['compliance'], inplace=True)

# Drop a number of features for reasons listed:
# Removing collection status and compliance detail to avoid data leakage. Removing violator name because
# it doesn't seem like that would provide much generalizable information. Removing information about violation
# location, replacing with latitude/longitude. Remove fine_amount, admin_fee, state_fee as they're rolled into
# judgment_amount, but keep the late_fee, discount_amount, and clean_up_cost. Maybe get rid of clean_up_cost
# later. Remove other columns related to payment, prevent data leakage. Removing mailing address st name and
# zip code, as well as non_us_str_code. Removing violation_description as it should overlap with violation_code.
# Removing city as I was killing the kernel trying to one hot encode it. Removing grafitti_status as the entire
# column was NaN. Had previously tried to do something clever by finding and using the time to hearing from the
# ticket_issued_date and hearing_date columns, but this provided odd data. For example, some values were less than
# zero...which is impossible. Removing late fee to get rid of data leakage.
droplist = ['violator_name', 'violation_street_number', 'violation_street_name', 
            'violation_zip_code', 'fine_amount','admin_fee','state_fee',
            'payment_amount', 'payment_date', 'payment_status', 'balance_due',
            'collection_status', 'compliance_detail', 
            'mailing_address_str_name', 'zip_code', 'non_us_str_code',
            'violation_description', 'city', 'grafitti_status', 'mailing_address_str_number',
            'ticket_issued_date', 'hearing_date', 'late_fee']
train.drop(droplist, axis=1, inplace=True)

In [69]:
# Merge address and lat lon, then merge into main df
addFull = pd.merge(addresses, latlons, on='address')

# Remove address as the info there is already covered by lat and lon
X = pd.merge(train, addFull, on='ticket_id').drop('address', axis=1)

# One-hot encode categorical variables
categoricalCols = ['agency_name', 'state', 'country', 'disposition', 
                   'violation_code', 'inspector_name']

for col in categoricalCols:
    X = pd.concat([X.drop(col, axis=1), pd.get_dummies(X[col])], axis=1)

# Introduce new feature - time from ticket issue to hearing date, remove old cols, drop odd vals
#X['time_to_hearing'] = (pd.to_datetime(X['hearing_date']) - pd.to_datetime(X['ticket_issued_date'])).dt.total_seconds()
#X.drop(['ticket_issued_date', 'hearing_date'], axis=1, inplace=True)
#X = X[X['time_to_hearing'] > 0]
#X = X[(np.abs(stats.zscore(X['time_to_hearing'])) < 3)] # drop outliers
    
# Do a final dropna
X.dropna(inplace=True)

# Get target values, then drop col from features
y = X['compliance']
X.drop('compliance', axis=1, inplace=True)

# Remove ticket id as this likely is not informative for future cases
X.drop('ticket_id', axis=1, inplace=True)

In [75]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split the training data. Won't do this for actual function, but is useful here for evaluation.
X_train, X_test, y_train, y_test = train_test_split(X, y)
rfc = RandomForestClassifier(max_features = 8, n_estimators = 10).fit(X_train, y_train)

# Actually looks surpisingly good!!
print('Training accuracy: {}'.format(rfc.score(X_train, y_train)))
print('Test accuracy: {}'.format(rfc.score(X_test, y_test)))

Training accuracy: 0.9873569736798212
Test accuracy: 0.9317988491368526


In [76]:
from sklearn.metrics import classification_report, roc_curve, auc

print(classification_report(y_test, rfc.predict(X_test), target_names=['non-compliant', 'compliant']))

fpr, tpr, _ = roc_curve(np.asarray(y_test), rfc.predict_proba(X_test)[:,1])
print('AUC: {}'.format(auc(fpr, tpr)))

               precision    recall  f1-score   support

non-compliant       0.95      0.98      0.96     37056
    compliant       0.57      0.27      0.37      2914

    micro avg       0.93      0.93      0.93     39970
    macro avg       0.76      0.63      0.67     39970
 weighted avg       0.92      0.93      0.92     39970

AUC: 0.7415297604071465


In [77]:
feature_importances = pd.DataFrame(rfc.feature_importances_, index = X_train.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances.iloc[0:10, :]

Unnamed: 0,importance
lon,0.306876
lat,0.305641
judgment_amount,0.083803
discount_amount,0.060611
Responsible by Default,0.043075
Responsible by Determination,0.03208
Responsible by Admission,0.025173
Responsible (Fine Waived) by Deter,0.005514
MI,0.003961
9-1-36(a),0.003501
