<a href="https://colab.research.google.com/github/marcelo-guimaraes/Data-Science/blob/master/Detroit_Blight_Violation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Understanding and Predicting Property Maintenance Fines

This notebook is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. The task is to predict whether a given blight ticket will be paid on time.


___


 The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data and False if the ticket was paid after the hearing date or not at all. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

In [None]:
!pip install scikit-optimize

Collecting scikit-optimize
[?25l  Downloading https://files.pythonhosted.org/packages/5c/87/310b52debfbc0cb79764e5770fa3f5c18f6f0754809ea9e2fc185e1b67d3/scikit_optimize-0.7.4-py2.py3-none-any.whl (80kB)
[K     |████                            | 10kB 20.3MB/s eta 0:00:01[K     |████████▏                       | 20kB 1.9MB/s eta 0:00:01[K     |████████████▎                   | 30kB 2.3MB/s eta 0:00:01[K     |████████████████▎               | 40kB 2.6MB/s eta 0:00:01[K     |████████████████████▍           | 51kB 2.5MB/s eta 0:00:01[K     |████████████████████████▌       | 61kB 2.8MB/s eta 0:00:01[K     |████████████████████████████▌   | 71kB 3.0MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.6MB/s 
Collecting pyaml>=16.9
  Downloading https://files.pythonhosted.org/packages/15/c4/1310a054d33abc318426a956e7d6df0df76a6ddfa9c66f6310274fb75d42/pyaml-20.4.0-py2.py3-none-any.whl
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-

### Import Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

import folium 

from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from skopt import gp_minimize

import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

from sklearn.metrics import recall_score, accuracy_score, f1_score, roc_auc_score, precision_score
from sklearn.model_selection import cross_val_score

pd.set_option('display.max_columns', 500)
%matplotlib inline

  import pandas.util.testing as tm


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


### Exploring the Datasets

In [None]:
df_train = pd.read_csv('/content/drive/My Drive/train.csv', encoding = "ISO-8859-1")
df_test = pd.read_csv('/content/drive/My Drive/test.csv', encoding = "ISO-8859-1")
address = pd.read_csv('/content/drive/My Drive/addresses.csv', encoding = "ISO-8859-1")
latlons = pd.read_csv('/content/drive/My Drive/latlons.csv', encoding = "ISO-8859-1")

In [None]:
print("Train Data Shape: {}\n".format(df_train.shape),
      "Test Data Shape: {}".format(df_test.shape))
df_train.head(3)

Train Data Shape: (250306, 34)
 Test Data Shape: (61001, 27)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,state,zip_code,non_us_str_code,country,ticket_issued_date,hearing_date,violation_code,violation_description,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,IL,60606,,USA,2004-03-16 11:40:00,2005-03-21 10:30:00,9-1-36(a),Failure of owner to obtain certificate of comp...,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,MI,48208,,USA,2004-04-23 12:30:00,2005-05-06 13:30:00,61-63.0600,Failed To Secure Permit For Lawful Use Of Buil...,Responsible by Determination,750.0,20.0,10.0,75.0,0.0,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
2,22062,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","SANDERS, DERRON",1449.0,LONGFELLOW,,23658.0,P.O. BOX,DETROIT,MI,48223,,USA,2004-04-26 13:40:00,2005-03-29 10:30:00,9-1-36(a),Failure of owner to obtain certificate of comp...,Not responsible by Dismissal,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,


In [None]:
df_train.describe()

Unnamed: 0,ticket_id,violation_street_number,violation_zip_code,mailing_address_str_number,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,payment_amount,balance_due,compliance
count,250306.0,250306.0,0.0,246704.0,250305.0,250306.0,250306.0,250306.0,250306.0,250306.0,250306.0,250306.0,250306.0,159880.0
mean,152665.543099,10649.86,,9149.788,374.423435,12.774764,6.387382,21.494506,0.125167,0.0,268.685356,48.898986,222.449058,0.072536
std,77189.882881,31887.33,,36020.34,707.195807,9.607344,4.803672,56.464263,3.430178,0.0,626.915212,222.422425,606.39401,0.259374
min,18645.0,0.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-7750.0,0.0
25%,86549.25,4739.0,,544.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,152597.5,10244.0,,2456.0,250.0,20.0,10.0,10.0,0.0,0.0,140.0,0.0,25.0,0.0
75%,219888.75,15760.0,,12927.25,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0,305.0,0.0
max,366178.0,14154110.0,,5111345.0,10000.0,20.0,10.0,1000.0,350.0,0.0,11030.0,11075.0,11030.0,1.0


In [None]:
def stats(df):

    # This func gets the dataset and returns the number os missing values,
    # cardinality and the type of each column
    
    return pd.DataFrame({'missing':df.isna().sum(),
                          'cardinality': df.nunique(),
                          'type': df.dtypes}).sort_values(by = ['missing','cardinality'], ascending = False)

In [None]:
stats(df_train)

Unnamed: 0,missing,cardinality,type
violation_zip_code,250306,0,float64
grafitti_status,250305,1,object
non_us_str_code,250303,2,object
collection_status,213409,1,object
payment_date,209193,2307,object
compliance,90426,2,float64
hearing_date,12491,6222,object
mailing_address_str_number,3602,15826,float64
state,93,59,object
violator_name,34,119992,object


### Data Processing and Cleansing

In [None]:
# Removing rows with missing values in the target column, and in the column 
# with dates ('hearing_date')
df_train = df_train[(df_train['compliance'] == 0) | (df_train['compliance'] == 1)]
df_train = df_train[~df_train['hearing_date'].isnull()]


# Here I joined the dataframes 'address' and 'latlons' with the train and test data
address = address.set_index('address').join(latlons.set_index('address'), how='left')
df_train = df_train.set_index('ticket_id').join(address.set_index('ticket_id'))
df_test = df_test.set_index('ticket_id').join(address.set_index('ticket_id'))

Below, I removed columns that are on training data but not on the test set. This kind  of columns would lead to a Data Leakage, that is when information from outside the training dataset is used to create the model. This type of problem can destroy your Machine Learning model, since the data may have a high significance on the preditive power, but when you run the model on the test set, these data used to train won't be available.

To read more about this subject, I recommend [this article](https://machinelearningmastery.com/data-leakage-machine-learning/) from Machine Learning Mastery



In [None]:
df_train.columns[~df_train.columns.isin(df_test.columns)]

Index(['payment_amount', 'balance_due', 'payment_date', 'payment_status',
       'collection_status', 'compliance_detail', 'compliance'],
      dtype='object')

In [None]:
train_remove_list = [
        'balance_due',
        'collection_status',
        'compliance_detail',
        'payment_amount',
        'payment_date',
        'payment_status']

df_train.drop(train_remove_list, axis=1, inplace=True)

In [None]:
## Using the date columns and extracting the month of each fine to use as a feature for the model

# first on test data

df_test.hearing_date.fillna(method='pad', inplace=True)

test_issued_months = []
test_hearing_months = []
for i in df_test.ticket_issued_date:
  test_issued_months.append(i[5:7])
for j in df_test.hearing_date:
  test_hearing_months.append(j[5:7])

df_test['issued_month'] = test_issued_months
df_test['hearing_month'] = test_hearing_months

df_test.issued_month = df_test.issued_month.astype('int8')
df_test.hearing_month = df_test.hearing_month.astype('int8')

In [None]:
# and now on the training set
train_isseud_months = []
train_hearing_months = []
for i in df_train.ticket_issued_date:
  train_isseud_months.append(i[5:7])
for j in df_train.hearing_date:
  train_hearing_months.append(j[5:7])

df_train['issued_month'] = train_isseud_months
df_train['hearing_month'] = train_hearing_months

df_train.issued_month = df_train.issued_month.astype('int8')
df_train.hearing_month = df_train.hearing_month.astype('int8')

In [None]:
## Cleaning the mailing_address_str_number column removing all characters and
## filling it with last valid observation

df_train.mailing_address_str_number.fillna(method='pad', inplace=True)
df_test.mailing_address_str_number.fillna(method='pad', inplace=True)


chars_to_remove = df_test[df_test['mailing_address_str_number'].apply(lambda x: not x.isnumeric())]['mailing_address_str_number'].unique()

df_test['mailing_address_str_number'] = df_test['mailing_address_str_number'].replace(chars_to_remove, np.nan)

df_test.mailing_address_str_number.fillna(method='pad', inplace=True)

df_test.mailing_address_str_number = df_test.mailing_address_str_number.astype('float')

In [None]:
## Cleaning the zip_code column removing all characters and
## filling it with last valid observation

# first on training data and then on test set 

chars_to_remove_train = df_train[df_train['zip_code'].apply(lambda x: not str(x).isnumeric())]['zip_code'].unique()

df_train['zip_code'] = df_train['zip_code'].replace(chars_to_remove_train, np.nan)

df_train.zip_code.fillna(method='pad', inplace=True)

df_train.zip_code = df_train.zip_code.astype('float')

#

chars_to_remove_test = df_test[df_test['zip_code'].apply(lambda x: not str(x).isnumeric())]['zip_code'].unique()

df_test['zip_code'] = df_test['zip_code'].replace(chars_to_remove_test, np.nan)

df_test.zip_code.fillna(method='pad', inplace=True)

df_test.zip_code = df_test.zip_code.astype('float')

In [None]:
# Removing some string columns and redundant data
# The criteria to remove the columns was the number of missing values, if the cardinality was too high
# and if the columns had just one value, that represents a multicollinearity.

string_remove_list = ['violation_zip_code', 'grafitti_status', 'non_us_str_code', 'violator_name',
                      'mailing_address_str_name','ticket_issued_date','hearing_date', 'city',
                      'violation_street_name', 'violation_description','violation_code', 'inspector_name',
                      'admin_fee', 'state_fee', 'clean_up_cost']

df_train.drop(string_remove_list, axis=1, inplace=True)
df_test.drop(string_remove_list, axis=1, inplace=True)

In [None]:
stats(df_train)

Unnamed: 0,missing,cardinality,type
state,84,59,object
lon,2,66767,float64
lat,2,61497,float64
violation_street_number,0,18094,float64
mailing_address_str_number,0,14079,float64
zip_code,0,3441,float64
judgment_amount,0,57,float64
fine_amount,0,40,float64
late_fee,0,37,float64
discount_amount,0,12,float64


In [None]:
# Filling Missing Values in Latitude and Longitude columns by last valid observations
df_train.state.fillna(method='pad', inplace=True)
df_train.lat.fillna(method='pad', inplace=True)
df_train.lon.fillna(method='pad', inplace=True)


df_test.state.fillna(method='pad', inplace=True)
df_test.lat.fillna(method='pad', inplace=True)
df_test.lon.fillna(method='pad', inplace=True)

In [None]:
# Dividing training data in target and features columns
y_train = df_train.compliance
train = df_train.drop('compliance', axis = 1)

# Converting categorical data in numerical with the get_dummies function
train_data = pd.get_dummies(train)
test_data = pd.get_dummies(df_test)
train_data, test_data = train_data.align(test_data, join='left', axis=1)

Because of the cardinality of each dataframe, the get_dummies function created different numbers os dummies for each one. So, to compensate these 'missing columns', we use the align function, that put the Train and Test data with the same number of columns

In [None]:
test_data.fillna(0, inplace = True)

In [None]:
print('Number of columns on Train Data: {}'.format(train_data.shape[1]),
      '\nNumber of columns on Test Data : {}'.format(test_data.shape[1]))
train_data.head()

Number of columns on Train Data: 84 
Number of columns on Test Data : 84


Unnamed: 0_level_0,violation_street_number,mailing_address_str_number,zip_code,fine_amount,late_fee,discount_amount,judgment_amount,lat,lon,issued_month,hearing_month,"agency_name_Buildings, Safety Engineering & Env Department",agency_name_Department of Public Works,agency_name_Detroit Police Department,agency_name_Health Department,agency_name_Neighborhood City Halls,state_AK,state_AL,state_AR,state_AZ,state_BC,state_BL,state_CA,state_CO,state_CT,state_DC,state_DE,state_FL,state_GA,state_HI,state_IA,state_ID,state_IL,state_IN,state_KS,state_KY,state_LA,state_MA,state_MD,state_ME,state_MI,state_MN,state_MO,state_MS,state_MT,state_NB,state_NC,state_ND,state_NH,state_NJ,state_NM,state_NV,state_NY,state_OH,state_OK,state_ON,state_OR,state_PA,state_PR,state_QC,state_QL,state_RI,state_SC,state_SD,state_TN,state_TX,state_UK,state_UT,state_VA,state_VI,state_VT,state_WA,state_WI,state_WV,state_WY,country_Aust,country_Cana,country_Egyp,country_Germ,country_USA,disposition_Responsible (Fine Waived) by Deter,disposition_Responsible by Admission,disposition_Responsible by Default,disposition_Responsible by Determination
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
22056,2900.0,3.0,60606.0,250.0,25.0,0.0,305.0,42.390729,-83.124268,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
27586,4311.0,2959.0,48208.0,750.0,75.0,0.0,855.0,42.326937,-83.135118,4,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
22046,6478.0,2755.0,908041512.0,250.0,25.0,0.0,305.0,42.145257,-83.208233,5,3,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
18738,8027.0,476.0,48038.0,750.0,75.0,0.0,855.0,42.433466,-83.023493,6,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
18735,8228.0,8228.0,48211.0,100.0,10.0,0.0,140.0,42.388641,-83.037858,6,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0


And that's it! Now the data is already prepared for the Machine Learning model

### Machine Learning Model

In [None]:
from sklearn.metrics import fbeta_score

# Creating a metric function that uses the f2 score

def f2_score(y_true, y_pred):
    # fbeta_score throws a confusing error if inputs are not numpy arrays
    y_true, y_pred, = np.array(y_true), np.array(y_pred)

    return fbeta_score(y_true.transpose(), y_pred.transpose(), beta=2, average="macro")

The intuition behind the F2 score is that it weights recall higher than precision. This makes the F2 score more suitable in certain applications where it’s more important to classify correctly as many positive samples as possible, rather than maximizing the number of correct classifications.

In [None]:
# Here I created a function that trains a RandomForest classifier on my data 
# and returns to me the 15 most important features for the model

def feature_importances(X, y):
    
    '''
    X : Features columns
    y : Target column
    
    '''
    
    # feature extraction
    model = RandomForestClassifier(n_estimators=10)
    model.fit(X, y)
    RandomForestClassifier(bootstrap=True, class_weight=None,
                           criterion='gini', max_depth=None, max_features='auto',
                           max_leaf_nodes=None, n_estimators=10, random_state=None,
                           verbose=0, warm_start=False)

    feature_importances = pd.DataFrame(model.feature_importances_,
                                       index = X.columns,
                                       columns=['importance']).sort_values('importance', ascending=False)
    return feature_importances.iloc[0:15]

In [None]:
# Build And Train Classifier Model
classifiers = [MLPClassifier(hidden_layer_sizes = [30, 30],alpha=0.01,random_state = 0, solver='lbfgs', verbose=0),
               KNeighborsClassifier(n_neighbors = 5, weights = 'distance', p = 2),
               AdaBoostClassifier(random_state = 0),
               RandomForestClassifier(random_state = 0)
              ]

def test_classifiers(classifiers, X, y):
    
    # This func gets a list of classifiers, train it, and return the roc-auc, recall and f1 score
    # of each one. This way, I'll be able to know with witch algorithm I'll proceed with my model
    
    """
    classifiers: List of classifiers you want to train
    X          : Features columns
    y          : Target column
    
    """

    X_train, X_test, y_train, y_test = train_test_split(X, y)


    for clf in classifiers:
        clf.fit(X_train, y_train)
        #y_pred = clf.predict(X_test)

        threshold = 0.4
        y_proba = clf.predict_proba(X_test)[:,1]
        y_pred = (y_proba >= threshold).astype('int')

        name = clf.__class__.__name__

        print('='*30)
        print(name)

        print('****Result****')
        roc = roc_auc_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        f2 = f2_score(y_test, y_pred)

        print(" Roc Auc Score: {}".format(roc),
              "\n Recall Score: {}".format(recall),
              "\n Precision Score: {}".format(precision),
              "\n F1 Score: {}".format(f1),
              "\n F2 Score: {}".format(f2))

In [None]:
test_classifiers(classifiers, train_data, y_train)

MLPClassifier
****Result****
 Roc Auc Score: 0.4935768590005528 
 Recall Score: 0.06926863572433192 
 Precision Score: 0.06078370873187288 
 F1 Score: 0.06474938373048479 
 F2 Score: 0.49362130168339174
KNeighborsClassifier
****Result****
 Roc Auc Score: 0.5909316847246457 
 Recall Score: 0.2109704641350211 
 Precision Score: 0.357355568790947 
 F1 Score: 0.26531063453460096 
 F2 Score: 0.597312789352954
AdaBoostClassifier
****Result****
 Roc Auc Score: 0.5 
 Recall Score: 1.0 
 Precision Score: 0.07125319436789096 
 F1 Score: 0.13302773749941532 
 F2 Score: 0.1386235133554299
RandomForestClassifier
****Result****
 Roc Auc Score: 0.6861549138987821 
 Recall Score: 0.3829113924050633 
 Precision Score: 0.7348178137651822 
 F1 Score: 0.5034674063800277 
 F2 Score: 0.7028264408534982


Ok, here we can verify two things, one is that the best model for this data eas the Ranfom forest. And the second thing is that the model returned a bad recall score, and that may be because the data is imbalanced

To try to solve this issue, I'll use the package [imblearn](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/) to balance the data 

#### Balancing the Dataset

In [None]:
Xtrain_for_balance, Xval_desbalanced, ytrain_for_balance, yval_desbalanced = train_test_split(train_data, y_train)

In [None]:
# And, as we can verify, most part of the data belongs to the class 
# that won't pay the fines on time
ytrain_for_balance.value_counts()

0.0    111157
1.0      8582
Name: compliance, dtype: int64

In [None]:
oversample = SMOTE()
X_sampled, y_sampled = oversample.fit_resample(Xtrain_for_balance, ytrain_for_balance)

In [None]:
pd.Series(y_sampled).value_counts()

1.0    111157
0.0    111157
dtype: int64

And, that's it. Now the classes are balanced and we can already train the model!

---

Since we could see that the RandomForest got a better performance, I'll proceed with it

In [None]:
# Creating model using a 0.7 threshold fo classification

clf = RandomForestClassifier(random_state = 0)
clf.fit(X_sampled, y_sampled)

threshold = 0.4
y_proba = clf.predict_proba(Xval_desbalanced)[:,1]
y_pred = (y_proba >= threshold).astype('int')

#y_pred = clf.predict(Xval_desbalanced)

roc = roc_auc_score(yval_desbalanced, y_pred)
recall = recall_score(yval_desbalanced, y_pred)
precision = precision_score(yval_desbalanced, y_pred)
f1 = f1_score(yval_desbalanced, y_pred)
f2 = f2_score(yval_desbalanced, y_pred)

print(" Roc Auc Score: {}".format(roc),
      "\n Recall Score: {}".format(recall),
      "\n Precision Score: {}".format(precision),
      "\n F2 Score: {}".format(f2))

 Roc Auc Score: 0.6958387731758836 
 Recall Score: 0.4117234117234117 
 Precision Score: 0.6122129436325678 
 F2 Score: 0.7078126475317373


As we can see, the model didn't have a great improvement. So, I'll try to improve it a little more using Hyperparameter-Tuning. But first, I'll use the function I created to see the 15 more imortant features in dataframe

In [None]:
Xtrain = pd.DataFrame(X_sampled)
Xtrain.columns = train_data.columns
feature_importances(Xtrain, y_sampled)

Unnamed: 0,importance
disposition_Responsible by Default,0.16381
judgment_amount,0.088127
disposition_Responsible by Admission,0.078977
disposition_Responsible by Determination,0.075979
hearing_month,0.074632
late_fee,0.070107
"agency_name_Buildings, Safety Engineering & Env Department",0.049164
issued_month,0.04881
violation_street_number,0.045797
mailing_address_str_number,0.04438


Ok, I did it, because maybe would be a good idea to do a feature engineering to better my features. But now, I'll proceed with the hyperparameter tuning:

In [None]:
def train_model(params):
    
    n_estimators = params[0]
    min_samples_split = params[1]
    
    print(params, '\n')
    
    clf = RandomForestClassifier(n_estimators = n_estimators, min_samples_split = min_samples_split, random_state = 0)
    clf.fit(Xtrain, ytrain)
    
    #pred = clf.predict_proba(Xval)[:,1]
    
    return -cross_val_score(clf, Xtrain, ytrain, cv = 2, scoring = 'recall').mean()



space = [(150, 1200),        # n_estimators
         (2, 100)]           # min_samples_split

In [None]:
result = gp_minimize(train_model, space, random_state = 1, verbose = 1, n_calls = 10, n_random_starts = 2)

Iteration No: 1 started. Evaluating function at random point.
[1197, 93] 

Iteration No: 1 ended. Evaluation done at random point.
Time taken: 867.7314
Function value obtained: -0.7849
Current minimum: -0.7849
Iteration No: 2 started. Evaluating function at random point.
[285, 100] 

Iteration No: 2 ended. Evaluation done at random point.
Time taken: 217.2221
Function value obtained: -0.7836
Current minimum: -0.7849
Iteration No: 3 started. Searching for the next optimal point.
[169, 2] 

Iteration No: 3 ended. Search finished for the next optimal point.
Time taken: 151.1331
Function value obtained: -0.8747
Current minimum: -0.8747
Iteration No: 4 started. Searching for the next optimal point.
[1200, 2] 

Iteration No: 4 ended. Search finished for the next optimal point.
Time taken: 1024.9138
Function value obtained: -0.8740
Current minimum: -0.8747
Iteration No: 5 started. Searching for the next optimal point.
[150, 2] 

Iteration No: 5 ended. Search finished for the next optimal poin

In [None]:
result.x

[153, 2]

Well, after few hours -and tuning different parameters- I could see that the standard parameters have already done aa good job and it would be very computationally expensive to improve the model with hyperparameter-tuning. But I will leave this part on the notebook anyway, because with other models that I tested, the tuning helped a lot, besides being much faster than with RandomForest

### Reporting Findings

In [None]:
# Training again the model with the parameters I found
clf = RandomForestClassifier(n_estimators=153, min_samples_split=2,random_state=0)

#clf = AdaBoostClassifier(n_estimators=793, learning_rate = 0.1,random_state=0)


clf.fit(X_sampled, y_sampled)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=153,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [None]:
# Creating a dataframe with lat, lon and ticket data from the testset

report_df = pd.DataFrame(columns = ({'lat','lon','ticket_id','judgment_amount'}))

report_df.set_index('ticket_id', inplace = True)

report_df.lat = test_data.lat
report_df.lon = test_data.lon
report_df.judgment_amount = test_data.judgment_amount
report_df.index = test_data.index

In [None]:
# Making predictions on the test and getting the probability 
report_df['compliance'] = clf.predict_proba(test_data)[:,1]

#report_df['compliance'] = clf.predict(test_data)

In [None]:
report_df

Unnamed: 0_level_0,judgment_amount,lon,lat,compliance
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
284932,250.0,-82.986642,42.407581,0.019608
285362,1130.0,-83.238259,42.426239,0.013072
285361,140.0,-83.238259,42.426239,0.130719
285338,250.0,-83.122426,42.309661,0.052288
285346,140.0,-83.121116,42.308830,0.091503
...,...,...,...,...
376496,1130.0,-83.140869,42.376675,0.019608
376497,1130.0,-83.140869,42.376675,0.019608
376499,140.0,-82.992015,42.409430,0.052288
376500,140.0,-82.991747,42.409525,0.026144


In [None]:
# Here I classify the probabilities of compliance in 3 groups:

# 0 -> Probably won't pay the fines on time
# 1 -> Some probability of not paying on time
# 2 -> Probably will pay on time

risk = []
risk_char = []

for x in report_df.compliance:
  if x <= 0.4:
    risk.append(0)
    risk_char.append('High')
  elif x > 0.4 and x < 0.7:
    risk.append(1)
    risk_char.append('Medium')
  else:
    risk.append(2)  
    risk_char.append('Low')  

In [None]:
report_df['risk'] = risk
report_df['risk_char'] = risk_char
report_df.tail()

Unnamed: 0_level_0,judgment_amount,lon,lat,compliance,risk,risk_char
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
376496,1130.0,-83.140869,42.376675,0.019608,0,High
376497,1130.0,-83.140869,42.376675,0.019608,0,High
376499,140.0,-82.992015,42.40943,0.052288,0,High
376500,140.0,-82.991747,42.409525,0.026144,0,High
369851,80.0,-83.12074,42.349152,0.784314,2,Low


In [None]:
report_df.risk_char.value_counts().sort_index()

High      54851
Low        3649
Medium     2501
Name: risk_char, dtype: int64

As we can see, most occurrences have a great risk of not paying the fines. Let's see how much this represents in loss of money for non-payment

In [None]:
print('Total Fines and Fees that probably would not be paid: ${}'.format(round(report_df['judgment_amount'][report_df['risk'] == 0].sum(), 2)))

Total Fines and Fees that probably would not be paid: $18968671.1


That is, if all these people do not actually pay, the loss will be $ 18,968,671.10


But these people are the ones who have a high chance of not paying, if we join with people who have moderate chances, then the loss will increase even more:

In [None]:
print('Total Fines and Fees that probably would not be paid: ${}'.format(round(report_df['judgment_amount'][report_df['risk'] != 2].sum(), 2)))

Total Fines and Fees that probably would not be paid: $20604701.3


Then, the loss may increase further, to around $ 20,604,701.30

These are the main descriptions of violations by those who are unlikely to pay fines on time:

In [None]:
test_raw = pd.read_csv('/content/drive/My Drive/test.csv', encoding = "ISO-8859-1")
test_raw.set_index('ticket_id', inplace = True)

In [None]:
test_raw['violation_description'][report_df['risk'] == 0].value_counts()[:10]

Excessive weeds or plant growth one- or two-family dwelling or commercial Building                                                         15126
Allowing bulk solid waste to lie or accumulate on or about the premises                                                                    13811
Failure of owner to obtain certificate of compliance                                                                                        8076
Violation of time limit for approved containers to remain at curbside - early or late                                                       2570
Inoperable motor vehicle(s) one- or two-family dwelling or commercial building                                                              1757
Failure to obtain certificate of registration for rental property                                                                           1553
Failure to maintain a vacant building or structure in accordance with the requirements of Section 9-1-113 of the Detroit City Code

Here, I got the name, address, and postal code of those who are unlikely to pay the fines. Thus, actions can be taken to better convert the chance of payment

In [None]:
cols = ['violator_name', 'judgment_amount', 'violation_street_number', 'violation_street_name',
        'zip_code', 'city']

test_raw[cols][report_df['risk'] == 0]

Unnamed: 0_level_0,violator_name,judgment_amount,violation_street_number,violation_street_name,zip_code,city
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
284932,"FLUELLEN, JOHN A",250.0,10041.0,ROSEBERRY,48213,DETROIT
285362,"WHIGHAM, THELMA",1130.0,18520.0,EVERGREEN,48219,DETROIT
285361,"WHIGHAM, THELMA",140.0,18520.0,EVERGREEN,48219,DETROIT
285338,"HARABEDIEN, POPKIN",250.0,1835.0,CENTRAL,48183,WOODHAVEN
285346,"CORBELL, STANLEY",140.0,1700.0,CENTRAL,48154,LIVONIA
...,...,...,...,...,...,...
376483,NPML Mortgage Acquistion LLC c/o Home Servicing,305.0,18827.0,KLINGER,70810,Baton Rouge
376496,THE AIC GROUP,1130.0,12032.0,SANTA ROSA,48037,Southfield
376497,THE AIC GROUP,1130.0,12032.0,SANTA ROSA,48037,Southfield
376499,"BARLOW, CHRISTOPHER D",140.0,11832.0,KILBOURNE,48213,DETROIT


And finally, I used the Folium package to create a map to show some occurrences of the test data with their locations, fine amount and risk of not paying

In [None]:
import folium

loc = [report_df['lat'].iloc[0]-0.03,report_df['lon'].iloc[0]-0.1]

m = folium.Map(location=loc, zoom_start=12)

colors = ['orange', 'seagreen', 'crimson']

for lat, lon, comp, ris, risc, amount in zip(report_df['lat'].head(6000), report_df['lon'].head(6000), report_df['compliance'].head(6000), report_df['risk_char'].head(6000), report_df['risk'].head(6000), report_df['judgment_amount'].head(6000)):

  label = folium.Popup(str(ris) +' Risk' + '\n Percentage:' + str(round(1 - comp, 2)) + '\n Amount: $' + str(amount))

  folium.CircleMarker(
      [lat,lon],
      radius = amount / 120,
      popup = label,
      color = colors[risc-1],
      fill = True, 
      fill_color = colors[risc-1],
      fill_opacity = 0.6,
      parse_html = False).add_to(m)

plt.tight_layout()
plt.savefig('output.png', dpi=300)

m

<Figure size 432x288 with 0 Axes>

### Conclusions

Well, to conclude and summarize this project, I believe that the model did a great job forecasting possible people who wouldn't pay on time, so with the dataframe I made available above the map, the name, address, and postal code of these residents could be taken and then the city could charge them in a different way or carry out payment promotions or things like that, probably an A/B test would say the best option. One thing is for sure, the city would certainly be able to receive a few million more if it took good strategies to approach these people and managed to convert the payment of fines.