The Michigan Data Science Team (MDST) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. Blight violations are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

In [45]:
'''
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

train_data = pd.read_csv('readonly/train.csv', engine = 'python')
test_data = pd.read_csv('readonly/test.csv', engine = 'python')
address = pd.read_csv('readonly/addresses.csv', engine = 'python')
latlon = pd.read_csv('readonly/latlons.csv', engine = 'python')


drop_list = ['violator_name', 'zip_code', 'city',
            'inspector_name', 'violation_street_number', 'violation_street_name',
            'violation_zip_code', 'violation_description',
            'mailing_address_str_number', 'mailing_address_str_name',
            'non_us_str_code', 'state',
            'ticket_issued_date', 'hearing_date', 'grafitti_status']
train_data.drop(drop_list, axis=1, inplace=True)
test_data.drop(drop_list, axis=1, inplace=True)
train_data = train_data[['ticket_id', 'agency_name', 'country', 'violation_code', 'disposition',
        'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
        'clean_up_cost', 'judgment_amount','compliance' ]]

address = address.set_index('address').join(latlon.set_index('address'), how='left')
train_data = train_data.set_index('ticket_id').join(address.set_index('ticket_id'))
test_data = test_data.set_index('ticket_id').join(address.set_index('ticket_id'))

train_data = train_data[np.isfinite(train_data['compliance'])]

train_data.lat.fillna(value = train_data['lat'].mean(), inplace=True)
train_data.lon.fillna(value = train_data['lon'].mean(), inplace=True)
test_data.lat.fillna(value = test_data['lat'].mean(), inplace=True)
test_data.lon.fillna(value = test_data['lon'].mean(), inplace=True)

test_data['country'].unique()
'''

array(['USA'], dtype=object)

In [46]:
# labelling columns using label_encoder function, never do manual mapping 
'''
le.fit(train_data['disposition'].append(test_data['disposition'], ignore_index=True))
train_data['disposition'] = le.transform(train_data['disposition'])
test_data['disposition'] = le.transform(test_data['disposition'])

le.fit(train_data['violation_code'].append(test_data['violation_code'], ignore_index=True))
train_data['violation_code'] = le.transform(train_data['violation_code'])
test_data['violation_code'] = le.transform(test_data['violation_code'])

le.fit(train_data['country'].append(test_data['country'], ignore_index=True))
train_data['country'] = le.transform(train_data['country'])
test_data['country'] = le.transform(test_data['country'])

le.fit(train_data['agency_name'].append(test_data['agency_name'], ignore_index=True))
train_data['agency_name'] = le.transform(train_data['agency_name'])
test_data['agency_name'] = le.transform(test_data['agency_name'])
'''

In [47]:
'''
############################ Baseline Score #################################

#nnn.fillna(method='pad', inplace=True)
from sklearn.model_selection import train_test_split
X = train_data.drop(['compliance'], axis = 1).values
y = train_data.loc[:,'compliance'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

nb = GaussianNB()
nbclf = nb.fit(X_train, y_train)
y_pred = nbclf.predict(X_test)
score = roc_auc_score(y_test, y_pred)
print('score is: ', score)
print('Confusion Matrix is: ', confusion_matrix(y_test, y_pred))
'''

score is:  0.550241362514
Confusion Matrix is:  [[37060    18]
 [ 2600   292]]


In [48]:
           ######################################## Main Regression Code ######################################
'''
clf = RandomForestRegressor()
grid_values = {'n_estimators': [10, 100], 'max_depth': [None, 30]}
clf_clf = GridSearchCV(clf, param_grid=grid_values, scoring='roc_auc')
clf_clf.fit(X_train, y_train)

print('Grid best parameter (max. AUC): ', clf_clf.best_params_)
print('Grid best score (AUC): ', clf_clf.best_score_)
'''

Grid best parameter (max. AUC):  {'max_depth': 30, 'n_estimators': 100}
Grid best score (AUC):  0.790284023186


In [49]:
def blight_model():
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.ensemble import RandomForestRegressor
    from sklearn import preprocessing

    le = preprocessing.LabelEncoder()

    #train_data = pd.read_csv('readonly/train.csv', engine = 'python')
    train_data = pd.read_csv('train.csv', engine = 'python')
    #test_data = pd.read_csv('readonly/test.csv', engine = 'python')
    test_data = pd.read_csv('test.csv', engine = 'python')
    #address = pd.read_csv('readonly/addresses.csv', engine = 'python')
    address = pd.read_csv('addresses.csv', engine = 'python')
    #latlon = pd.read_csv('readonly/latlons.csv', engine = 'python')
    latlon = pd.read_csv('latlons.csv', engine = 'python')


    drop_list = ['violator_name', 'zip_code', 'city',
                'inspector_name', 'violation_street_number', 'violation_street_name',
                'violation_zip_code', 'violation_description',
                'mailing_address_str_number', 'mailing_address_str_name',
                'non_us_str_code', 'state',
                'ticket_issued_date', 'hearing_date', 'grafitti_status']
    train_data.drop(drop_list, axis=1, inplace=True)
    test_data.drop(drop_list, axis=1, inplace=True)
    train_data = train_data[['ticket_id', 'agency_name', 'country', 'violation_code', 'disposition',
            'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
            'clean_up_cost', 'judgment_amount','compliance' ]]

    address = address.set_index('address').join(latlon.set_index('address'), how='left')
    train_data = train_data.set_index('ticket_id').join(address.set_index('ticket_id'))
    test_data = test_data.set_index('ticket_id').join(address.set_index('ticket_id'))

    train_data = train_data[np.isfinite(train_data['compliance'])]

    train_data.lat.fillna(value = train_data['lat'].mean(), inplace=True)
    train_data.lon.fillna(value = train_data['lon'].mean(), inplace=True)
    test_data.lat.fillna(value = test_data['lat'].mean(), inplace=True)
    test_data.lon.fillna(value = test_data['lon'].mean(), inplace=True)

    le.fit(train_data['disposition'].append(test_data['disposition'], ignore_index=True))
    train_data['disposition'] = le.transform(train_data['disposition'])
    test_data['disposition'] = le.transform(test_data['disposition'])

    le.fit(train_data['violation_code'].append(test_data['violation_code'], ignore_index=True))
    train_data['violation_code'] = le.transform(train_data['violation_code'])
    test_data['violation_code'] = le.transform(test_data['violation_code'])

    le.fit(train_data['country'].append(test_data['country'], ignore_index=True))
    train_data['country'] = le.transform(train_data['country'])
    test_data['country'] = le.transform(test_data['country'])

    le.fit(train_data['agency_name'].append(test_data['agency_name'], ignore_index=True))
    train_data['agency_name'] = le.transform(train_data['agency_name'])
    test_data['agency_name'] = le.transform(test_data['agency_name'])

    X = train_data.drop(['compliance'], axis = 1).values
    y = train_data.loc[:,'compliance'].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    
    
    
    clf = RandomForestRegressor()
    grid_values = {'n_estimators': [10, 100], 'max_depth': [None, 30]}
    clf_clf = GridSearchCV(clf, param_grid=grid_values, scoring='roc_auc')
    clf_clf.fit(X_train, y_train)
    
    print('Grid best parameter (max. AUC): ', clf_clf.best_params_)
    print('Grid best score (AUC): ', clf_clf.best_score_)

    return pd.DataFrame(clf_clf.predict(test_data), test_data.index) # Your answer here

In [50]:
blight_model()

Grid best parameter (max. AUC):  {'max_depth': 30, 'n_estimators': 100}
Grid best score (AUC):  0.790750078409


Unnamed: 0_level_0,0
ticket_id,Unnamed: 1_level_1
284932,0.000804
285362,0.000210
285361,0.040800
285338,0.020920
285346,0.060168
285345,0.030920
285347,0.128762
285342,0.990000
285530,0.000000
284989,0.030287
