---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

## Assignment 4 - Understanding and Predicting Property Maintenance Fines

This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:

* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)
* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)
* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)
* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)
* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)

___

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

<br>

**File descriptions** (Use only this data for training your model!)

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). 

Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.
___

For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `test.csv` will be paid, and the index being the ticket_id.

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32

## Imports and Notes

In [355]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression   # C param
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier  # alpha param, activation funcs
# nnclf = MLPClassifier(hidden_layer_sizes=[10, 10], solver='lbfgs', random_state=0).fit(X_train, y_train)


# clf.feature_importances_
# plot_feature_importances(clf, cancer.feature_names)


from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)


from sklearn.model_selection import cross_val_score
# clf = KNeighborsClassifier(n_neighbors=5)
# X = X_fruits_2d.as_matrix()
# y = y_fruits_2d.as_matrix()
# cv_scores = cross_val_score(clf, X, y, cv=10, scoring='roc_auc')

# must scale stuff within each fold
# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

from sklearn.model_selection import validation_curve
# creates own folds to test parameters

# pd.get_dummies makes categoricals into numerics
# mush_df2 = pd.get_dummies(mush_df)

from sklearn.dummy import DummyClassifier  # try different strategies
# dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
# y_dummy_predictions = dummy_majority.predict(X_test)
# y_dummy_predictions
# dummy_majority.score(X_test, y_test)

from sklearn.metrics import confusion_matrix
# confusion = confusion_matrix(y_test, y_majority_predicted)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall) 

from sklearn.metrics import classification_report
# print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))

from sklearn.metrics import roc_curve, auc, roc_auc_score
# y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
# fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
# roc_auc_lr = auc(fpr_lr, tpr_lr)

from sklearn.model_selection import GridSearchCV
# grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
# grid_values = {'penalty':['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}

# grid_clf_auc = GridSearchCV(clf, param_grid=grid_values, scoring='roc_auc')
# grid_clf_auc.fit(X_train, y_train)
# y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test)

# print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
# print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
# print('Grid best score (AUC): ', grid_clf_auc.best_score_)

## Load Data and Explore

In [255]:
train_df = pd.read_csv('train.csv', encoding='ISO-8859-1', index_col='ticket_id')
test_df = pd.read_csv('test.csv', encoding='ISO-8859-1', index_col='ticket_id')
addresses = pd.read_csv('addresses.csv', index_col='ticket_id')
latlons = pd.read_csv('latlons.csv')

drop_fields = [ # Leaky fields
                'payment_amount', 
                'payment_date', 
                'payment_status', 
                'balance_due', 
                'collection_status', 
                'compliance_detail', 
                # Several posts on the forums say that late_fee is necessary to use to get over 0.8 AUC, despite being a leakage
#                 'late_fee',
               
                # Other fields that have no information gain
                'violator_name',
                'violation_zip_code',
                'grafitti_status', 
                'violation_description',
                'violation_code',  # maybe
                'non_us_str_code',
                'mailing_address_str_number',
                'mailing_address_str_name',
                'city',
                'state',
#                 'zip_code',
                'country']                        

train_df.drop(drop_fields, axis=1, inplace=True)
train_df.dropna(how='all', subset=['compliance'], inplace=True)

train_df = pd.merge(left=train_df, right=addresses, how='left', left_index=True, right_index=True)

train_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,agency_name,inspector_name,violation_street_number,violation_street_name,zip_code,ticket_issued_date,hearing_date,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,compliance,address
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie",2900.0,TYLER,60606,2004-03-16 11:40:00,2005-03-21 10:30:00,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0,"2900 tyler, Detroit MI"
27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin",4311.0,CENTRAL,48208,2004-04-23 12:30:00,2005-05-06 13:30:00,Responsible by Determination,750.0,20.0,10.0,75.0,0.0,0.0,855.0,1.0,"4311 central, Detroit MI"
22046,"Buildings, Safety Engineering & Env Department","Sims, Martinzie",6478.0,NORTHFIELD,908041512,2004-05-01 11:50:00,2005-03-21 10:30:00,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0,"6478 northfield, Detroit MI"
18738,"Buildings, Safety Engineering & Env Department","Williams, Darrin",8027.0,BRENTWOOD,48038,2004-06-14 14:15:00,2005-02-22 15:00:00,Responsible by Default,750.0,20.0,10.0,75.0,0.0,0.0,855.0,0.0,"8027 brentwood, Detroit MI"
18735,"Buildings, Safety Engineering & Env Department","Williams, Darrin",8228.0,MT ELLIOTT,48211,2004-06-16 12:30:00,2005-02-22 15:00:00,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,0.0,"8228 mt elliott, Detroit MI"


In [256]:
train_df.shape

(159880, 17)

In [257]:
train_df.describe()

Unnamed: 0,violation_street_number,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,compliance
count,159880.0,159880.0,159880.0,159880.0,159880.0,159880.0,159880.0,159880.0,159880.0
mean,10713.16,357.035295,20.0,10.0,33.651512,0.195959,0.0,420.650218,0.072536
std,36231.59,675.65558,0.0,0.0,67.692916,4.290344,0.0,742.555062,0.259374
min,0.0,0.0,20.0,10.0,0.0,0.0,0.0,0.0,0.0
25%,4920.0,200.0,20.0,10.0,10.0,0.0,0.0,250.0,0.0
50%,10398.0,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0
75%,15783.25,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0
max,14154110.0,10000.0,20.0,10.0,1000.0,350.0,0.0,11030.0,1.0


### Null Counts
- Zero violation zip codes
- Missing 227 hearing dates

In [258]:
train_df.isnull().sum()

agency_name                  0
inspector_name               0
violation_street_number      0
violation_street_name        0
zip_code                     1
ticket_issued_date           0
hearing_date               227
disposition                  0
fine_amount                  0
admin_fee                    0
state_fee                    0
late_fee                     0
discount_amount              0
clean_up_cost                0
judgment_amount              0
compliance                   0
address                      0
dtype: int64

### Fees, Extra costs, Discounts
- Discounts have way higher rate of compliance
- Late fees have way lower rate of compliance
- No rows with a late fee and a discount

In [260]:
cols = ['fine_amount', 
        'admin_fee', 
        'state_fee', 
        'late_fee', 
        'discount_amount', 
        'clean_up_cost', 
        'judgment_amount', 
        'compliance']

# Sub in any feature to look at df results
train_df[cols][train_df['late_fee'] > 0].sort_values('judgment_amount', ascending=False).head()

Unnamed: 0_level_0,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,compliance
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
63477,10000.0,20.0,10.0,1000.0,0.0,0.0,11030.0,0.0
200507,10000.0,20.0,10.0,1000.0,0.0,0.0,11030.0,0.0
221177,10000.0,20.0,10.0,1000.0,0.0,0.0,11030.0,0.0
218313,10000.0,20.0,10.0,1000.0,0.0,0.0,11030.0,0.0
92968,10000.0,20.0,10.0,1000.0,0.0,0.0,11030.0,0.0


#### When there was a discount, what percentage of people were compliant?

In [438]:
result_1 = train_df[(train_df['discount_amount'] > 0)]['compliance'].mean()
result_1

0.9516949152542373

#### When there was a late fee, what percentage of people were compliant?

In [439]:
result_2 = train_df[(train_df['late_fee'] > 0)]['compliance'].mean()
result_2

0.04151029621525806

#### When there was a discount AND a late fee, what percentage of people were compliant?

In [441]:
result_3 = train_df[(train_df['discount_amount'] > 0) & (train_df['late_fee'] > 0)]['compliance'].mean()
result_3

nan

In [264]:
train_df.clean_up_cost.value_counts()

0.0    159880
Name: clean_up_cost, dtype: int64

#### clean_up_cost is all 0

In [265]:
train_df.drop('clean_up_cost', axis=1, inplace=True)

### Dates

#### Explore

In [266]:
cols = ['ticket_issued_date', 'hearing_date', 'compliance']
train_df[cols].sort_values('ticket_issued_date', ascending=False).tail()

Unnamed: 0_level_0,ticket_issued_date,hearing_date,compliance
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18738,2004-06-14 14:15:00,2005-02-22 15:00:00,0.0
22046,2004-05-01 11:50:00,2005-03-21 10:30:00,0.0
27586,2004-04-23 12:30:00,2005-05-06 13:30:00,1.0
22056,2004-03-16 11:40:00,2005-03-21 10:30:00,0.0
226673,1988-05-06 20:00:00,2010-01-25 13:30:00,1.0


In [267]:
train_df.ticket_issued_date.dtype
train_df.hearing_date.dtype

dtype('O')

#### Cast features to date_time

In [268]:
train_df['ticket_issued_date'] = pd.to_datetime(train_df['ticket_issued_date'], infer_datetime_format=True)
train_df['hearing_date'] = pd.to_datetime(train_df['hearing_date'], infer_datetime_format=True)
train_df['days_to_hearing'] = train_df['hearing_date'] - train_df['ticket_issued_date']

In [269]:
train_df.days_to_hearing.dtype

dtype('<m8[ns]')

#### There are some errors in the year columns where the hearing date is after the issue date
- Either the ticket_issue_date and the hearing_date were switched (in which case absolute value will fix)
- or the hearing_date year should be plus 1

In [270]:
cols = ['ticket_issued_date', 
        'hearing_date', 
        'days_to_hearing', 
        'compliance']

train_df[cols].sort_values('days_to_hearing', ascending=True).head(100)

Unnamed: 0_level_0,ticket_issued_date,hearing_date,days_to_hearing,compliance
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
239257,2010-12-23 12:00:00,2010-01-21 09:00:00,-337 days +21:00:00,0.0
239255,2010-12-23 12:00:00,2010-01-21 15:00:00,-336 days +03:00:00,1.0
239245,2010-12-21 11:00:00,2010-01-25 13:30:00,-330 days +02:30:00,0.0
239330,2010-12-22 11:00:00,2010-01-29 09:00:00,-328 days +22:00:00,1.0
155345,2008-12-31 13:30:00,2008-02-25 13:30:00,-310 days +00:00:00,1.0
239345,2010-12-10 10:50:00,2010-02-10 09:00:00,-304 days +22:10:00,0.0
267968,2011-12-22 09:00:00,2011-03-01 15:00:00,-296 days +06:00:00,0.0
267966,2011-12-22 09:00:00,2011-03-02 15:00:00,-295 days +06:00:00,0.0
267970,2011-12-22 09:00:00,2011-03-03 15:00:00,-294 days +06:00:00,0.0
267054,2011-12-16 22:15:00,2011-03-10 09:00:00,-282 days +10:45:00,0.0


### Geographic Location

#### In each violation zip code, what percentage of people were compliant?

In [272]:
zip_info = train_df.groupby('zip_code')['compliance'].agg(['size', 'sum', 'mean'])
zip_info.rename(columns={'size': 'count', 
                         'sum': 'compliant', 
                         'mean': 'percent_compliant'}, inplace=True)

zip_info['non_compliant'] = zip_info['count'] - zip_info['compliant']
zip_info['percent_non_compliant'] = 1 - zip_info['percent_compliant']

zip_info.sort_values('non_compliant', ascending=False).head(10)

Unnamed: 0_level_0,count,compliant,percent_compliant,non_compliant,percent_non_compliant
zip_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
48227,4467,206.0,0.046116,4261.0,0.953884
48221,4524,269.0,0.059461,4255.0,0.940539
48235,4299,248.0,0.057688,4051.0,0.942312
48219,3940,228.0,0.057868,3712.0,0.942132
48228,3681,214.0,0.058136,3467.0,0.941864
48224,3559,169.0,0.047485,3390.0,0.952515
48238,3475,200.0,0.057554,3275.0,0.942446
48205,2948,140.0,0.04749,2808.0,0.95251
48204,2807,194.0,0.069113,2613.0,0.930887
48227,2849,259.0,0.090909,2590.0,0.909091


#### Addresses 

In [273]:
addresses.sort_values('address', ascending=False).head()

Unnamed: 0_level_0,address
ticket_id,Unnamed: 1_level_1
361924,"9999 longacre, Detroit MI 48227"
221077,"9999 gratiot, Detroit MI"
64492,"9999 gratiot, Detroit MI"
343392,"9999 cheyenne, Detroit MI 48227"
343558,"9999 cheyenne, Detroit MI 48227"


#### Latitudes and Longitudes

In [274]:
latlons.sort_values('address', ascending=False).head()

Unnamed: 0,address,lat,lon
41551,"9999 longacre, Detroit MI 48227",42.369004,-83.214585
102721,"9999 gratiot, Detroit MI",42.394179,-83.004524
111570,"9999 cheyenne, Detroit MI 48227",42.369795,-83.174467
41786,"9999 asbury park, Detroit MI",42.369134,-83.207346
47593,"9999 abington ave, Detroit MI",42.367799,-83.21066


In [275]:
merge = pd.merge(train_df, latlons, on='address', how='left')
merge.isnull().sum()

agency_name                  0
inspector_name               0
violation_street_number      0
violation_street_name        0
zip_code                     1
ticket_issued_date           0
hearing_date               227
disposition                  0
fine_amount                  0
admin_fee                    0
state_fee                    0
late_fee                     0
discount_amount              0
judgment_amount              0
compliance                   0
address                      0
days_to_hearing            227
lat                          2
lon                          2
dtype: int64

In [278]:
cols = ['violation_street_number',            
        'violation_street_name',              
#         'violation_zip_code',            
#         'mailing_address_str_number',  
#         'mailing_address_str_name',         
#         'city',                               
#         'state',                             
#         'zip_code',                           
#         'country',
        'address',
        'lat',
        'lon']

merge[cols].sort_values('address', ascending=True).head()

Unnamed: 0,violation_street_number,violation_street_name,address,lat,lon
61173,0.0,AARON ST,"0 aaron st, Detroit MI",42.366959,-83.031095
135769,0.0,CHIPPEWA,"0 chippewa, Detroit MI",42.438542,-83.272719
136775,0.0,COVINGTON,"0 covington, Detroit MI",42.420327,-83.112464
125968,0.0,COVINGTON,"0 covington, Detroit MI",42.420327,-83.112464
125935,0.0,COVINGTON,"0 covington, Detroit MI",42.420327,-83.112464


### Categoricals

#### Agency Name

In [279]:
# agents, inspectors
agencies = train_df.groupby('agency_name')['compliance'].agg(['size', 'sum', 'mean'])
agencies.rename(columns={'size': 'count', 
                     'sum': 'compliant', 
                     'mean': 'percent_compliant'}, inplace=True)

agencies
# ucats_an = set(train['agency_name'])|{'<unknown>'}

# train['agency_name']= pd.Categorical(train['agency_name'],categories=ucats_an).fillna('<unknown>').codes

# test['agency_name']= pd.Categorical(test['agency_name'],categories=ucats_an).fillna('<unknown>').codes

Unnamed: 0_level_0,count,compliant,percent_compliant
agency_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Buildings, Safety Engineering & Env Department",95863,5823.0,0.060743
Department of Public Works,52445,4718.0,0.089961
Detroit Police Department,4464,588.0,0.13172
Health Department,7107,468.0,0.065851
Neighborhood City Halls,1,0.0,0.0


#### Inspector Names

In [283]:
train_df.inspector_name.value_counts()

Morris, John            11604
Samaan, Neil J           8720
O'Neal, Claude           8075
Steele, Jonathan         6962
Devaney, John            6837
Hayes, Billy J           6385
Sloane, Bennie J         5624
Sims, Martinzie          5526
Zizi, Josue              5060
Doetsch, James           4337
Danielson, Keith D       3880
Gailes, Orbie J          3451
Jones, Leah              3014
Davis, Darlene           2770
Legge, Gerald            2598
Havard, Jacqueline       2435
Sharpe, Anthony          2401
Johnson, Lois            2062
Harris, Rickey           1947
DeRamer, Andrew          1946
Moore, David             1825
Frazier, Willie          1805
Karwowski, Stephen       1786
ELLARD, EVERETT          1759
Watson, Jerry            1674
Addison, Michael         1667
Talbert, Reginald        1633
Matthews, Delos          1618
Williamson, Lillett      1539
Shimko, James            1522
                        ...  
Madrigal, Michael          11
O'Neil, Vincent T          10
Sievers, M

In [290]:
inspectors = train_df.groupby(['agency_name', 'inspector_name'])['compliance'].agg(['size', 'sum', 'mean'])
inspectors.rename(columns={'size': 'count', 
                           'sum': 'compliant', 
                           'mean': 'percent_compliant'}, inplace=True)

inspectors.sort_values('count', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,compliant,percent_compliant
agency_name,inspector_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Buildings, Safety Engineering & Env Department","Morris, John",11604,455.0,0.039211
"Buildings, Safety Engineering & Env Department","Samaan, Neil J",8720,640.0,0.073394
"Buildings, Safety Engineering & Env Department","O'Neal, Claude",8075,545.0,0.067492
"Buildings, Safety Engineering & Env Department","Steele, Jonathan",6962,334.0,0.047975
"Buildings, Safety Engineering & Env Department","Devaney, John",6837,396.0,0.057920
Department of Public Works,"Hayes, Billy J",6385,491.0,0.076899
"Buildings, Safety Engineering & Env Department","Sloane, Bennie J",5624,348.0,0.061878
"Buildings, Safety Engineering & Env Department","Sims, Martinzie",5526,225.0,0.040717
Department of Public Works,"Zizi, Josue",4453,376.0,0.084437
"Buildings, Safety Engineering & Env Department","Doetsch, James",4337,286.0,0.065944


### Disposition

### Violation Code

Unnamed: 0,fine_amount,admin_fee,state_fee,late_fee,discount_amount,judgment_amount,lat,lon
0,200.0,20.0,10.0,20.0,0.0,250.0,42.407581,-82.986642
1,1000.0,20.0,10.0,100.0,0.0,1130.0,42.426239,-83.238259
2,100.0,20.0,10.0,10.0,0.0,140.0,42.426239,-83.238259
3,200.0,20.0,10.0,20.0,0.0,250.0,42.309661,-83.122426
4,100.0,20.0,10.0,10.0,0.0,140.0,42.30883,-83.121116


## Model Testing
(Decision Tree/Gradient Boosted Classifer/Random Forest, Logistic Regression)

In [409]:
train_df = pd.read_csv('train.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('test.csv', encoding='ISO-8859-1')
addresses = pd.read_csv('addresses.csv')
latlons = pd.read_csv('latlons.csv')  

# Join addresses and latlons
train_df = pd.merge(left=train_df, right=addresses, how='left', left_on='ticket_id', right_on='ticket_id').set_index('ticket_id')
train_df = pd.merge(left=train_df, right=latlons, on='address', how='left')
test_df = pd.merge(left=test_df, right=addresses, how='left', left_on='ticket_id', right_on='ticket_id').set_index('ticket_id')
test_df = pd.merge(left=test_df, right=latlons, on='address', how='left')

train_drop = ['payment_amount', 
              'payment_date', 
              'payment_status', 
              'balance_due', 
              'collection_status', 
              'compliance_detail']

train_test_drop = ['violator_name',
                   'violation_street_number',
                   'violation_street_name',
                   'violation_zip_code',
                   'grafitti_status',
                   'clean_up_cost',
                   'violation_description',
                   'non_us_str_code',
                   'mailing_address_str_number',
                   'mailing_address_str_name',
                   'city',
                   'state',
                   'zip_code',
                   'country',
                   'address',
                   # optional categoricals/dates
                   'violation_code',
                   'disposition',
                   'agency_name',
                   'inspector_name',
                   'hearing_date',
                   'ticket_issued_date']

# Drop unnecessary cols
train_df.drop(train_drop, axis=1, inplace=True)
train_df.drop(train_test_drop, axis=1, inplace=True)
test_df.drop(train_test_drop, axis=1, inplace=True)
train_df.dropna(how='all', subset=['compliance'], inplace=True)

# Fill missing lat lon values
train_df.fillna(0, inplace=True)
test_df.fillna(0, inplace=True)

# Train Test Split from train_df (save test_df)
features = ['fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount', 'judgment_amount', 'lat', 'lon']
X = train_df[features]
y = train_df['compliance']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Scale data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model
lr = LogisticRegression().fit(X_train_scaled, y_train)
y_predict_proba = lr.predict_proba(X_test_scaled)

  interactivity=interactivity, compiler=compiler, result=result)


## Final Model

In [435]:
def blight_model():
    
    train_df = pd.read_csv('train.csv', encoding='ISO-8859-1')
    test_df = pd.read_csv('test.csv', encoding='ISO-8859-1')
    addresses = pd.read_csv('addresses.csv')
    latlons = pd.read_csv('latlons.csv')  

    # Join addresses and latlons
    train_df = pd.merge(left=train_df, right=addresses, how='left', left_on='ticket_id', right_on='ticket_id')
    train_df = pd.merge(left=train_df, right=latlons, on='address', how='left')
    
    test_df = pd.merge(left=test_df, right=addresses, how='left', left_on='ticket_id', right_on='ticket_id')
    test_df = pd.merge(left=test_df, right=latlons, on='address', how='left')

    train_drop = ['payment_amount', 
                  'payment_date', 
                  'payment_status', 
                  'balance_due', 
                  'collection_status', 
                  'compliance_detail']

    train_test_drop = ['violator_name',
                       'violation_street_number',
                       'violation_street_name',
                       'violation_zip_code',
                       'grafitti_status',
                       'clean_up_cost',
                       'violation_description',
                       'non_us_str_code',
                       'mailing_address_str_number',
                       'mailing_address_str_name',
                       'city',
                       'state',
                       'zip_code',
                       'country',
                       'address',
                       # optional categoricals/dates
                       'violation_code',
                       'disposition',
                       'agency_name',
                       'inspector_name',
                       'hearing_date',
                       'ticket_issued_date']

    # Drop unnecessary cols
    train_df.drop(train_drop, axis=1, inplace=True)
    train_df.drop(train_test_drop, axis=1, inplace=True)
    test_df.drop(train_test_drop, axis=1, inplace=True)
    train_df.dropna(how='all', subset=['compliance'], inplace=True)

    # Fill missing lat lon values
    train_df.fillna(0.0, inplace=True)
    test_df.fillna(0.0, inplace=True)

    # Train Test Split from train_df (save test_df)
    features = ['ticket_id', 'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount', 'judgment_amount', 'lat', 'lon']
    X_train = train_df[features].set_index('ticket_id')
    X_test = test_df[features].set_index('ticket_id')
    y_train = train_df['compliance']

    # Scale data
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Model
    lr = LogisticRegression().fit(X_train_scaled, y_train)
    y_predict_proba = lr.predict_proba(X_test_scaled)
    result = pd.Series(y_predict_proba[:,1], index=X_test.index)
    
    return result


## Sanity Check

In [436]:
# bm = blight_model()
# res = 'Data type Test: '
# res += ['Failed: type(bm) should Series\n','Passed\n'][type(bm)==pd.Series]
# res += 'Data shape Test: '
# res += ['Failed: len(bm) should be 61001\n','Passed\n'][len(bm)==61001]
# res += 'Data Values Test: '
# res += ['Failed: all values should be in [0.,1.]\n','Passed\n'][all((bm<=1.) & (bm>=0.))]
# res += 'Data Values type Test: '
# res += ['Failed: bm.dtype should be float\n','Passed\n'][str(bm.dtype).count('float')>0]
# res += 'Index type Test: '
# res += ['Failed: type(bm.index) should be Int64Index\n','Passed\n'][type(bm.index)==pd.Int64Index]
# res += 'Index values type Test: '
# res += ['Failed: type(bm.index[0]) should be int64\n','Passed\n'][str(type(bm.index[0])).count("int64")>0]

# res += 'Output index shape test:'
# res += ['Failed, bm.index.shape should be (61001,)\n','Passed\n'][bm.index.shape==(61001,)]

# res += 'Output index test: '
# if bm.index.shape==(61001,):
#     res +=['Failed\n','Passed\n'][all(pd.read_csv('test.csv',usecols=[0],index_col=0).sort_index().index.values==bm.sort_index().index.values)]
# else:
#     res+='Failed'
# print(res)

  if self.run_code(code, result):


Data type Test: Passed
Data shape Test: Passed
Data Values Test: Passed
Data Values type Test: Passed
Index type Test: Passed
Index values type Test: Passed
Output index shape test:Passed
Output index test: Passed

